View
214
Download
1
Embed Size (px)
Citation preview
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Data-Driven Business Data-Driven Business Intelligence Systems: Intelligence Systems: Part IPart I
Week 5Dr. Jocelyn San PedroSchool of Information
Management & SystemsMonash University
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 2
Lecture OutlineLecture Outline
Data-driven BIS Data warehouse Data warehouse architectures Entity-Relationship Modelling Multi-dimensional Modelling Star Schema An Example: Retail Trading
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 3
Learning ObjectivesLearning ObjectivesAt the end of this lecture, the students will Have better understanding of concepts, tools
and technology underlying data-driven business intelligence systems
Have knowledge of multidimensional modelling and star schema for data modelling for data warehouses
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 5
Data-Driven BISData-Driven BISData-driven BIS information systems that provide BI through
access and manipulation of large databases of structured data
includes tools for “drill down” for more detailed information “drill up” for broader, more summarised view “slice and dice” for a change in data
dimensions
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 8
Data-Driven BISData-Driven BIS
Time
Leverling
$201,196
Davolio
$182,500
Salesperson
“Slicing” the cube
Product
Peacock
$225,764
Fuller
$162,504
Dodsworth
$75,048
King
$116,963
Suyama
$72,528
Callahan
$123,033
Buchanan
$68,792
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 9
Data-Driven BISData-Driven BIS
PeacockLeverling
Buchanan
Q1 Q2 Q3 Q4
SuyamanKing
Dodsworth
CallahanDavolioFuller
“Dicing” the cube $225,764
$72,528
$201,196
$116,963
$162,504
$75,048
$182,500$123,033
$68,792
$22,7
19
$6,8
58
$16,0
35
$23,1
81
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 11
Data WarehouseData WarehouseA data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection of data in support of management’s decision making process – Bill Inmon (1995)
Subject-oriented: focus is on subjects related to business or organisational activity like customers, employees, suppliers (instead of applications-oriented (finance, marketing, production)
Integrated: data from various databases is stored in a consistent format through use of naming conventions, domain constraints, physical attributes and measurements
Time-variant: associating data with specific points in time
Nonvolatile: data does not change once it is in the data warehouse and stored in data warehouse
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 12
Data WarehouseData WarehouseData warehouse is a copy of transaction
data specifically structured for query and analysis – Ralph Kimball (1996)
Data warehouse is a specific database designed and populated to provide decision support in an organisation - Gray and Watson (1998)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 13
Data WarehouseData WarehouseData warehousing emerged as result of improvements in database technology – relational
data model and relational database management systems (DBMS)
advances in computer hardware - emergence of affordable mass storage and parallel computer architectures
emergence of end-user computing, facilitated by powerful, intuitive computer interfaces and tools
advances in middleware products that enable enterprise database connectivity across heterogeneous platforms
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 14
Data WarehouseData Warehouse triggered by recognition of fundamental
differences between operational (or production) systems and informational (or decision support) systems Operational system – system that is used to
run a business in real time, based on current data – e.g. sales order processing, reservation systems, patient registration
Informational systems – designed to support decision making based on stable point-in-time or historical data; for complex read-only queries or data mining applications – e.g. sales trend analysis, customer segmentation, human resources planning
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 15
Data WarehouseData Warehouse
Characteristic
Operational Systems
Informational Systems
Primary purpose
Run the business on a current basis
Support managerial decision making
Type of data Current representation of state of the business
Historical or point-in-time (snapshots)
Primary users Clerks, salespersons, administrators
Managers, business analysts, customers
Scope of usage
Narrow vs. simple updates and queries
Broad vs. complex queries and analysis
Design goal performance Ease of access and use
Comparison of Operational and informational Systems – McFadden, Hoffer and Prescott 1999
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 16
Data Warehouse Data Warehouse ArchitecturesArchitecturesGeneric two-level
architecture
Transformation and Integration
Data warehouse
Source (file)
Source database)
Source (database)
Source (database)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 17
Data Warehouse Data Warehouse ArchitecturesArchitectures
Transformation and Integration
Enterprise Data
warehouse
Source (file)
Source database)
Source (database)
Source (database)
Selection and aggregation
Data mart
Data mart
Three-level architecture
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 18
Data Warehouse Data Warehouse ArchitecturesArchitecturesData mart a data warehouse that is limited in scope contains selected and summarised data to
support specific decision support applications of specific end-user group
e.g., marketing data mart, finance data mart
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 19
Data Warehouse Data Warehouse ArchitecturesArchitecturesThree-layer data
architecture
Enterprise data model
Derived data
Reconciled data
Operational data
Data mart
metada
EDW metadata
Operational metadata
Operational systems
Enterprise data warehouse
Data mart
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 20
Data Warehouse Data Warehouse ArchitecturesArchitecturesEnterprise data model Presents a total picture explaining the
data required by an organisation Must be developed prior to designing a
data warehouse Entity-Relationship Models – traditional
approach in relational database design Multidimensional Models – are commonly
used in data warehouses and data marts for faster retrieval for querying and analysis
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 21
Data Warehouse Data Warehouse ArchitecturesArchitecturesOperational Data current or transient, not historical restricted in scope to a particular application poor quality not normalised (there are multi-valued
attributes or repeating groups, partial dependencies, transitive dependencies in data relations)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 22
Data Warehouse Data Warehouse ArchitecturesArchitectures
Sample Operational Data from Northwind database
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 23
Data Warehouse Data Warehouse ArchitecturesArchitecturesReconciled Data Detailed - rather than summarised Historical – snapshots, periodic Comprehensive – should reflect enterprise-wide
perspective; conform to enterprise data model Quality controlled Normalised – 3NF or higher
3NF – no multi-valued attributes, no partial dependencies, no transitive dependencies
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 24
Data Warehouse Data Warehouse ArchitecturesArchitectures
Remove transitive dependencies
Table with multi-valued attributes
1st Normal Form
2nd Normal Form
3rd Normal Form
Remove multi-valued attributes
Remove partial dependencies
Steps in Normalisation
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 25
Data Warehouse Data Warehouse ArchitecturesArchitectures
Cust_ID Name Salesperson RegionCust_ID Name Salesperson
Salesperson Region
Sales relation with sample data Relation in 3NF
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 26
Data Warehouse Data Warehouse ArchitecturesArchitecturesDerived Data selected, formatted, aggregated provides ease of use for decision support
applications provides fast response for user queries supports ad-hoc queries and data mining
applications data model commonly used is star schema
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 27
Data Warehouse Data Warehouse ArchitecturesArchitecturesMetadata data that describe the properties or
characteristics of other data Operational metadata – describe the data in
various operational systems (as well as external data) that feed the EDW
EDW metadata – describe the reconciled data layer as well as the rules for transforming operational data to reconciled data
Data mart metadata – describe the data in derived data layer and rules of transforming reconciled data to derived data
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 28
Data Warehouse Data Warehouse ArchitecturesArchitecturesSample data description
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 29
Data Warehouse Data Warehouse ArchitecturesArchitecturesSample data description
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 30
Data Warehouse Data Warehouse ArchitecturesArchitecturesData Reconciliation Process Stage 1: Initial load, when EDW is first created Stage 2: Subsequent updates
Steps in Data Reconciliation Process Capture – extract relevant data from source/s
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 31
Data Warehouse Data Warehouse ArchitecturesArchitectures Scrub – clean or upgrade the quality of raw data before
transformation and loading (using pattern recognition, artificial intelligence techniques) Track and correct errors: misspelled names,
erroneous birthdates, missing data; inconsistent data formats
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 32
Data Warehouse Data Warehouse ArchitecturesArchitectures Transform - includes
converting data format or representation from source to target system
partitioning data according to predefined criteria aggregating data from detailed to summary level
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 33
Data Warehouse Data Warehouse ArchitecturesArchitecturesLoad and Index Refresh mode – filling
the EDW by bulk rewriting of target data
Update mode – only changes in source data are written to the data warehouse; at periodic intervals, data warehouse is rewritten, replacing previous contents without overwriting or deleting previous contents
Create necessary indexes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 35
Sale
Period
Product Store
(based on Kimball (1996), p29, and Simsion-Bowles (1996), p2)
Customer RegionCustomer
Type
ProductType
groups in
groups within
within
containsmakes
located at
Entity-Relationship Entity-Relationship ModellingModelling
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 36
Entity-Relationship Entity-Relationship ModellingModelling Entities, attributes and relationships Rules of normalisation
3NF is typical Protection of integrity of database by
avoiding anomalies Every logical thing is represented only once
Separate consideration of logical and physical aspects
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 37
Entity-Relationship Entity-Relationship ModellingModellingER Model for the Northwind sample database
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 38
Entity-Relationship Entity-Relationship ModellingModelling
Large numbers of tables Oracle Financials - 1,800; SAP 7 up to 8,000
Commonly used Feels natural once you get used to it
Research shows that they are not easily understood by IT people Especially concepts like abstraction,
generalisation, sub-types, etc.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 40
Multi-dimensional Multi-dimensional ModellingModelling It is possible to conceptualise data as multi-
dimensional Difficult to design Easy to use resulting reports Advocated by Ralph Kimball (see his manifesto,
and a rebuttal, available on the web site). A logical design technique that seeks to present
data in a standard framework that is intuitive and allows for high-performance access.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 41
Multi-dimensional Multi-dimensional ModellingModelling An approach to database design that provides
an easy to understand and navigate database The aim is to encourage understanding,
exploration and learning Each number in a database has a set of
associated attributes What it measures, what point of time it was
created, what location its from, what product it’s associated with, what promotion, etc.
This makes the number meaningful.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 42
Multi-dimensional Multi-dimensional ModellingModelling
Each attribute associated with each number represents a dimension Measure, time, location, product, location,
etc. Resulting views are easy to navigate and move
around Slice and dice Report template
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 43
Multi-dimensional Multi-dimensional ModellingModelling
One Dimension (State):
43.6 53.4 31.4 27.5 28.3 14.7
Vic NSW QLD WA SA TAS
Two Dimensions (location x time):
43.6 53.4 31.4 27.5 28.3 14.7
Vic NSW QLD WA SA TAS
46.2 52.1 29.6 25.1 27.1 18.2
56.3 62.3 35.1 29.4 21.5 13.3
50.1 57.2 33.6 28.1 22.5 16.3
48.2 53.4 31.4 28.4 25.1 15.4
1998
1999
2000
2001
2002
State
State
Year
Example Widget Sales ($Million)Example Widget Sales ($Million)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 44
Multi-dimensional Multi-dimensional ModellingModelling
Three Dimensions
(location x time x product):
43.6 53.4 31.4 27.5 28.3 14.7
Vic NSW QLD WA SA TAS
46.2 52.1 29.6 25.1 27.1 18.2
56.3 62.3 35.1 29.4 21.5 13.3
50.1 57.2 33.6 28.1 22.5 16.3
48.2 53.4 31.4 28.4 25.1 15.4
1998
1999
2000
2001
2002
Year
State
Widgets
SprocketsG
asketsFlanges
Product
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 45
Multi-dimensional Multi-dimensional ModellingModelling Usually talk about information spaces as cubes,
or hyper-cubes, or n-cubes Resulting views of databases are easy to
navigate and move around Slicing and dicing Report Template
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 46
Multi-dimensional Multi-dimensional ModellingModelling
Slicing and Dicing Select certain dimension values to examine a set of
data:
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 47
Multi-dimensional Multi-dimensional ModellingModelling
Report Templates One template is produced for a set of slices
Data changes, layout doesn’t
Location Drop Down Box
Year Drop Down Box
0
10
20
30
40
50
Widgets Sprockets Flanges Gaskets
Product Sales: Victoria, 2001
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 48
From Pilot Software OLAP White Paper
Typical relational data-base
Same data displayed in two-dimensions
Easy! (The key is to identify the continuous and discrete variables in the flat file.)
Multi-dimensional Multi-dimensional ModellingModelling
From Traditional Relational to Multi-dimensionalFrom Traditional Relational to Multi-dimensional
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 50
Star SchemaStar Schema Used to implement dimensional analysis using
standard relational database technology Very common in data warehousing
Many variations Two components:
Fact Table – contains measurements of business, eg. sales, purchase order, shipment
Dimension Tables – stores the textual descriptions of the dimensions of the business, eg. product, customer, vendor, store.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 51
Star SchemaStar Schema Fact tables store the hard data Dimension tables store all the
information about our dimensions. The fact table has a many-to-one
relationship with each dimension table Each dimension table has a primary
key that appears as a foreign key in the fact table, whose primary key is a concatenation of all of the foreign keys.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 52
Star SchemaStar Schema Dimension tables in star schemas
are denormalised resulting in: Fewer tables Simpler for users to navigate Reduced number of complex
multi-join tables.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 53
SaleTime keyStore keyCustomer keyProduct keyDollar salesUnit sales
CustomerCustomer keyNameCustomer type
ProductProduct keyProduct typeweight
StoreStore keyAddressRegion
TimeTime keyDayMonth
Star schemaStar schema
Legend:Primary KeyForeign key
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 54
Snowflake schemaSnowflake schema
Sale
Time
Product Store
Customer
ProductType
CustomerType
Region
“Do not snowflake your dimensions, even if very large. If you do snowflake your dimensions, prepare to live with poor performance” Kimball (1996)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 55
Star SchemaStar Schema Dimensions can be shared amongst fact tables.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 56
Star SchemaStar Schema ER schemas are useful for data mapping to
legacy systems and for integration of the data warehouse
Star schemas are useful for the design of warehouse databases as they are efficient and easy to understand and use Allow relational databases to support multi-
dimensional data cubes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 57
Star SchemaStar SchemaSteps in the design process 1. Choose a business process2. Choose the grain of the fact table
Too fine > Oversized databaseToo large > Loss of meaningful information
3. Choose the dimensions4. Choose the measured facts
(usually numeric, additive quantities)5. Complete the dimension tables
Kimball (1996)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 58
Extra steps in the design Extra steps in the design processprocess6. Determine strategy for slowly changing
dimensions7. Create aggregations and other physical
storage components8. Determine the historical duration of the
database9. Determine the urgency with which the data is
to be extracted and loaded into the data warehouse.
Kimball (1996)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 59
An Example: Retail TradingAn Example: Retail Trading A large grocery store with approx. 500 stores Each store has approx. 60,000 products on
shelves Need to maximise profit and keep shelves stocked Important decisions concern pricing and
promotion Promotion types are:
Temporary price reductions Newspaper advertisements Shelf and end-aisle displays Coupons
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 60
An Example: Retail TradingAn Example: Retail Trading1. Choose a Business Process
Daily Item Movement
2. Choose the grain of the fact table Stock Keeping Unit (SKU) by store by
promotion by day
3. Choose the Dimensions Time, product, store and promotion
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 61
An Example: Retail TradingAn Example: Retail Trading
SaleTime keyProduct keyStore KeyPromotion KeyFacts – to be detailed next
PromotionPromotion keyOther Promotion attributes
ProductProduct keyOther product attributes
StoreStore keyOther Store Attributes
TimeTime keyOther Time Attributes
Retail Trading Dimensions
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 62
An Example: Retail TradingAn Example: Retail Trading4. Choose the measured facts
PromotionPromotion keyOther Promotion attributes
ProductProduct keyOther product attributes
StoreStore keyOther Store Attributes
TimeTime keyOther Time Attributes
SaleTime keyProduct keyStore KeyPromotion KeyDollar SalesUnit SalesDollar CostsCustomer Count
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 63
An Example: Retail TradingAn Example: Retail Trading5. Complete the dimension tables
PromotionPromotion keyOther Promotion attributes
ProductProduct keySKU DescriptionSKU NumberPackage SizeBrandSub CategoryDepartmentPackage TypeDiet TypeWeightWeight unit of measureUnits per retail caseUnits per ship caseCases per pallet
StoreStore keyOther Store Attributes
TimeTime keyOther Time Attributes
SaleTime keyProduct keyStore KeyPromotion KeyDollar SalesUnit SalesDollar CostsCustomer Count
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 64
ReferencesReferences Inmon, W. H. (1996) Building the Data
Warehouse (2nd ed), Wiley, NY. Kimball, R. (1996) The Data Warehouse
Toolkit, Wiley, NY. McFadden, F., Hoffer, J. and Prescott, M.
(1999) Modern Database Management, Addison-Wesley.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 65
Questions?
[email protected] of Information Management and
Systems, Monash UniversityT1.28, T Block, Caulfield Campus
9903 2735