65
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 Data-Driven Business Data-Driven Business Intelligence Systems: Intelligence Systems: Part I Part I Week 5 Dr. Jocelyn San Pedro School of Information Management & Systems Monash University

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1, 2004 Data-Driven Business Intelligence Systems: Part I Week 5 Dr. Jocelyn San Pedro School of Information

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004

Data-Driven Business Data-Driven Business Intelligence Systems: Intelligence Systems: Part IPart I

Week 5Dr. Jocelyn San PedroSchool of Information

Management & SystemsMonash University

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 2

Lecture OutlineLecture Outline

Data-driven BIS Data warehouse Data warehouse architectures Entity-Relationship Modelling Multi-dimensional Modelling Star Schema An Example: Retail Trading

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 3

Learning ObjectivesLearning ObjectivesAt the end of this lecture, the students will Have better understanding of concepts, tools

and technology underlying data-driven business intelligence systems

Have knowledge of multidimensional modelling and star schema for data modelling for data warehouses

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 4

Data-Driven Business Intelligence Systems

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 5

Data-Driven BISData-Driven BISData-driven BIS information systems that provide BI through

access and manipulation of large databases of structured data

includes tools for “drill down” for more detailed information “drill up” for broader, more summarised view “slice and dice” for a change in data

dimensions

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 6

Data-Driven BISData-Driven BIS

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 7

Data-Driven BISData-Driven BIS

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 8

Data-Driven BISData-Driven BIS

Time

Leverling

$201,196

Davolio

$182,500

Salesperson

“Slicing” the cube

Product

Peacock

$225,764

Fuller

$162,504

Dodsworth

$75,048

King

$116,963

Suyama

$72,528

Callahan

$123,033

Buchanan

$68,792

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 9

Data-Driven BISData-Driven BIS

PeacockLeverling

Buchanan

Q1 Q2 Q3 Q4

SuyamanKing

Dodsworth

CallahanDavolioFuller

“Dicing” the cube $225,764

$72,528

$201,196

$116,963

$162,504

$75,048

$182,500$123,033

$68,792

$22,7

19

$6,8

58

$16,0

35

$23,1

81

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 10

Data Warehouse

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 11

Data WarehouseData WarehouseA data warehouse is a subject-oriented,

integrated, time-variant, nonvolatile collection of data in support of management’s decision making process – Bill Inmon (1995)

Subject-oriented: focus is on subjects related to business or organisational activity like customers, employees, suppliers (instead of applications-oriented (finance, marketing, production)

Integrated: data from various databases is stored in a consistent format through use of naming conventions, domain constraints, physical attributes and measurements

Time-variant: associating data with specific points in time

Nonvolatile: data does not change once it is in the data warehouse and stored in data warehouse

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 12

Data WarehouseData WarehouseData warehouse is a copy of transaction

data specifically structured for query and analysis – Ralph Kimball (1996)

Data warehouse is a specific database designed and populated to provide decision support in an organisation - Gray and Watson (1998)

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 13

Data WarehouseData WarehouseData warehousing emerged as result of improvements in database technology – relational

data model and relational database management systems (DBMS)

advances in computer hardware - emergence of affordable mass storage and parallel computer architectures

emergence of end-user computing, facilitated by powerful, intuitive computer interfaces and tools

advances in middleware products that enable enterprise database connectivity across heterogeneous platforms

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 14

Data WarehouseData Warehouse triggered by recognition of fundamental

differences between operational (or production) systems and informational (or decision support) systems Operational system – system that is used to

run a business in real time, based on current data – e.g. sales order processing, reservation systems, patient registration

Informational systems – designed to support decision making based on stable point-in-time or historical data; for complex read-only queries or data mining applications – e.g. sales trend analysis, customer segmentation, human resources planning

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 15

Data WarehouseData Warehouse

Characteristic

Operational Systems

Informational Systems

Primary purpose

Run the business on a current basis

Support managerial decision making

Type of data Current representation of state of the business

Historical or point-in-time (snapshots)

Primary users Clerks, salespersons, administrators

Managers, business analysts, customers

Scope of usage

Narrow vs. simple updates and queries

Broad vs. complex queries and analysis

Design goal performance Ease of access and use

Comparison of Operational and informational Systems – McFadden, Hoffer and Prescott 1999

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 16

Data Warehouse Data Warehouse ArchitecturesArchitecturesGeneric two-level

architecture

Transformation and Integration

Data warehouse

Source (file)

Source database)

Source (database)

Source (database)

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 17

Data Warehouse Data Warehouse ArchitecturesArchitectures

Transformation and Integration

Enterprise Data

warehouse

Source (file)

Source database)

Source (database)

Source (database)

Selection and aggregation

Data mart

Data mart

Three-level architecture

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 18

Data Warehouse Data Warehouse ArchitecturesArchitecturesData mart a data warehouse that is limited in scope contains selected and summarised data to

support specific decision support applications of specific end-user group

e.g., marketing data mart, finance data mart

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 19

Data Warehouse Data Warehouse ArchitecturesArchitecturesThree-layer data

architecture

Enterprise data model

Derived data

Reconciled data

Operational data

Data mart

metada

EDW metadata

Operational metadata

Operational systems

Enterprise data warehouse

Data mart

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 20

Data Warehouse Data Warehouse ArchitecturesArchitecturesEnterprise data model Presents a total picture explaining the

data required by an organisation Must be developed prior to designing a

data warehouse Entity-Relationship Models – traditional

approach in relational database design Multidimensional Models – are commonly

used in data warehouses and data marts for faster retrieval for querying and analysis

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 21

Data Warehouse Data Warehouse ArchitecturesArchitecturesOperational Data current or transient, not historical restricted in scope to a particular application poor quality not normalised (there are multi-valued

attributes or repeating groups, partial dependencies, transitive dependencies in data relations)

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 22

Data Warehouse Data Warehouse ArchitecturesArchitectures

Sample Operational Data from Northwind database

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 23

Data Warehouse Data Warehouse ArchitecturesArchitecturesReconciled Data Detailed - rather than summarised Historical – snapshots, periodic Comprehensive – should reflect enterprise-wide

perspective; conform to enterprise data model Quality controlled Normalised – 3NF or higher

3NF – no multi-valued attributes, no partial dependencies, no transitive dependencies

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 24

Data Warehouse Data Warehouse ArchitecturesArchitectures

Remove transitive dependencies

Table with multi-valued attributes

1st Normal Form

2nd Normal Form

3rd Normal Form

Remove multi-valued attributes

Remove partial dependencies

Steps in Normalisation

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 25

Data Warehouse Data Warehouse ArchitecturesArchitectures

Cust_ID Name Salesperson RegionCust_ID Name Salesperson

Salesperson Region

Sales relation with sample data Relation in 3NF

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 26

Data Warehouse Data Warehouse ArchitecturesArchitecturesDerived Data selected, formatted, aggregated provides ease of use for decision support

applications provides fast response for user queries supports ad-hoc queries and data mining

applications data model commonly used is star schema

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 27

Data Warehouse Data Warehouse ArchitecturesArchitecturesMetadata data that describe the properties or

characteristics of other data Operational metadata – describe the data in

various operational systems (as well as external data) that feed the EDW

EDW metadata – describe the reconciled data layer as well as the rules for transforming operational data to reconciled data

Data mart metadata – describe the data in derived data layer and rules of transforming reconciled data to derived data

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 28

Data Warehouse Data Warehouse ArchitecturesArchitecturesSample data description

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 29

Data Warehouse Data Warehouse ArchitecturesArchitecturesSample data description

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 30

Data Warehouse Data Warehouse ArchitecturesArchitecturesData Reconciliation Process Stage 1: Initial load, when EDW is first created Stage 2: Subsequent updates

Steps in Data Reconciliation Process Capture – extract relevant data from source/s

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 31

Data Warehouse Data Warehouse ArchitecturesArchitectures Scrub – clean or upgrade the quality of raw data before

transformation and loading (using pattern recognition, artificial intelligence techniques) Track and correct errors: misspelled names,

erroneous birthdates, missing data; inconsistent data formats

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 32

Data Warehouse Data Warehouse ArchitecturesArchitectures Transform - includes

converting data format or representation from source to target system

partitioning data according to predefined criteria aggregating data from detailed to summary level

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 33

Data Warehouse Data Warehouse ArchitecturesArchitecturesLoad and Index Refresh mode – filling

the EDW by bulk rewriting of target data

Update mode – only changes in source data are written to the data warehouse; at periodic intervals, data warehouse is rewritten, replacing previous contents without overwriting or deleting previous contents

Create necessary indexes

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 34

Entity-Relationship Modelling

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 35

Sale

Period

Product Store

(based on Kimball (1996), p29, and Simsion-Bowles (1996), p2)

Customer RegionCustomer

Type

ProductType

groups in

groups within

within

containsmakes

located at

Entity-Relationship Entity-Relationship ModellingModelling

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 36

Entity-Relationship Entity-Relationship ModellingModelling Entities, attributes and relationships Rules of normalisation

3NF is typical Protection of integrity of database by

avoiding anomalies Every logical thing is represented only once

Separate consideration of logical and physical aspects

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 37

Entity-Relationship Entity-Relationship ModellingModellingER Model for the Northwind sample database

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 38

Entity-Relationship Entity-Relationship ModellingModelling

Large numbers of tables Oracle Financials - 1,800; SAP 7 up to 8,000

Commonly used Feels natural once you get used to it

Research shows that they are not easily understood by IT people Especially concepts like abstraction,

generalisation, sub-types, etc.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 39

Multi-dimensional Modelling

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 40

Multi-dimensional Multi-dimensional ModellingModelling It is possible to conceptualise data as multi-

dimensional Difficult to design Easy to use resulting reports Advocated by Ralph Kimball (see his manifesto,

and a rebuttal, available on the web site). A logical design technique that seeks to present

data in a standard framework that is intuitive and allows for high-performance access.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 41

Multi-dimensional Multi-dimensional ModellingModelling An approach to database design that provides

an easy to understand and navigate database The aim is to encourage understanding,

exploration and learning Each number in a database has a set of

associated attributes What it measures, what point of time it was

created, what location its from, what product it’s associated with, what promotion, etc.

This makes the number meaningful.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 42

Multi-dimensional Multi-dimensional ModellingModelling

Each attribute associated with each number represents a dimension Measure, time, location, product, location,

etc. Resulting views are easy to navigate and move

around Slice and dice Report template

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 43

Multi-dimensional Multi-dimensional ModellingModelling

One Dimension (State):

43.6 53.4 31.4 27.5 28.3 14.7

Vic NSW QLD WA SA TAS

Two Dimensions (location x time):

43.6 53.4 31.4 27.5 28.3 14.7

Vic NSW QLD WA SA TAS

46.2 52.1 29.6 25.1 27.1 18.2

56.3 62.3 35.1 29.4 21.5 13.3

50.1 57.2 33.6 28.1 22.5 16.3

48.2 53.4 31.4 28.4 25.1 15.4

1998

1999

2000

2001

2002

State

State

Year

Example Widget Sales ($Million)Example Widget Sales ($Million)

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 44

Multi-dimensional Multi-dimensional ModellingModelling

Three Dimensions

(location x time x product):

43.6 53.4 31.4 27.5 28.3 14.7

Vic NSW QLD WA SA TAS

46.2 52.1 29.6 25.1 27.1 18.2

56.3 62.3 35.1 29.4 21.5 13.3

50.1 57.2 33.6 28.1 22.5 16.3

48.2 53.4 31.4 28.4 25.1 15.4

1998

1999

2000

2001

2002

Year

State

Widgets

SprocketsG

asketsFlanges

Product

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 45

Multi-dimensional Multi-dimensional ModellingModelling Usually talk about information spaces as cubes,

or hyper-cubes, or n-cubes Resulting views of databases are easy to

navigate and move around Slicing and dicing Report Template

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 46

Multi-dimensional Multi-dimensional ModellingModelling

Slicing and Dicing Select certain dimension values to examine a set of

data:

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 47

Multi-dimensional Multi-dimensional ModellingModelling

Report Templates One template is produced for a set of slices

Data changes, layout doesn’t

Location Drop Down Box

Year Drop Down Box

0

10

20

30

40

50

Widgets Sprockets Flanges Gaskets

Product Sales: Victoria, 2001

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 48

From Pilot Software OLAP White Paper

Typical relational data-base

Same data displayed in two-dimensions

Easy! (The key is to identify the continuous and discrete variables in the flat file.)

Multi-dimensional Multi-dimensional ModellingModelling

From Traditional Relational to Multi-dimensionalFrom Traditional Relational to Multi-dimensional

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 49

Star Schema

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 50

Star SchemaStar Schema Used to implement dimensional analysis using

standard relational database technology Very common in data warehousing

Many variations Two components:

Fact Table – contains measurements of business, eg. sales, purchase order, shipment

Dimension Tables – stores the textual descriptions of the dimensions of the business, eg. product, customer, vendor, store.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 51

Star SchemaStar Schema Fact tables store the hard data Dimension tables store all the

information about our dimensions. The fact table has a many-to-one

relationship with each dimension table Each dimension table has a primary

key that appears as a foreign key in the fact table, whose primary key is a concatenation of all of the foreign keys.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 52

Star SchemaStar Schema Dimension tables in star schemas

are denormalised resulting in: Fewer tables Simpler for users to navigate Reduced number of complex

multi-join tables.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 53

SaleTime keyStore keyCustomer keyProduct keyDollar salesUnit sales

CustomerCustomer keyNameCustomer type

ProductProduct keyProduct typeweight

StoreStore keyAddressRegion

TimeTime keyDayMonth

Star schemaStar schema

Legend:Primary KeyForeign key

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 54

Snowflake schemaSnowflake schema

Sale

Time

Product Store

Customer

ProductType

CustomerType

Region

“Do not snowflake your dimensions, even if very large. If you do snowflake your dimensions, prepare to live with poor performance” Kimball (1996)

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 55

Star SchemaStar Schema Dimensions can be shared amongst fact tables.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 56

Star SchemaStar Schema ER schemas are useful for data mapping to

legacy systems and for integration of the data warehouse

Star schemas are useful for the design of warehouse databases as they are efficient and easy to understand and use Allow relational databases to support multi-

dimensional data cubes

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 57

Star SchemaStar SchemaSteps in the design process 1. Choose a business process2. Choose the grain of the fact table

Too fine > Oversized databaseToo large > Loss of meaningful information

3. Choose the dimensions4. Choose the measured facts

(usually numeric, additive quantities)5. Complete the dimension tables

Kimball (1996)

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 58

Extra steps in the design Extra steps in the design processprocess6. Determine strategy for slowly changing

dimensions7. Create aggregations and other physical

storage components8. Determine the historical duration of the

database9. Determine the urgency with which the data is

to be extracted and loaded into the data warehouse.

Kimball (1996)

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 59

An Example: Retail TradingAn Example: Retail Trading A large grocery store with approx. 500 stores Each store has approx. 60,000 products on

shelves Need to maximise profit and keep shelves stocked Important decisions concern pricing and

promotion Promotion types are:

Temporary price reductions Newspaper advertisements Shelf and end-aisle displays Coupons

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 60

An Example: Retail TradingAn Example: Retail Trading1. Choose a Business Process

Daily Item Movement

2. Choose the grain of the fact table Stock Keeping Unit (SKU) by store by

promotion by day

3. Choose the Dimensions Time, product, store and promotion

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 61

An Example: Retail TradingAn Example: Retail Trading

SaleTime keyProduct keyStore KeyPromotion KeyFacts – to be detailed next

PromotionPromotion keyOther Promotion attributes

ProductProduct keyOther product attributes

StoreStore keyOther Store Attributes

TimeTime keyOther Time Attributes

Retail Trading Dimensions

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 62

An Example: Retail TradingAn Example: Retail Trading4. Choose the measured facts

PromotionPromotion keyOther Promotion attributes

ProductProduct keyOther product attributes

StoreStore keyOther Store Attributes

TimeTime keyOther Time Attributes

SaleTime keyProduct keyStore KeyPromotion KeyDollar SalesUnit SalesDollar CostsCustomer Count

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 63

An Example: Retail TradingAn Example: Retail Trading5. Complete the dimension tables

PromotionPromotion keyOther Promotion attributes

ProductProduct keySKU DescriptionSKU NumberPackage SizeBrandSub CategoryDepartmentPackage TypeDiet TypeWeightWeight unit of measureUnits per retail caseUnits per ship caseCases per pallet

StoreStore keyOther Store Attributes

TimeTime keyOther Time Attributes

SaleTime keyProduct keyStore KeyPromotion KeyDollar SalesUnit SalesDollar CostsCustomer Count

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 64

ReferencesReferences Inmon, W. H. (1996) Building the Data

Warehouse (2nd ed), Wiley, NY. Kimball, R. (1996) The Data Warehouse

Toolkit, Wiley, NY. McFadden, F., Hoffer, J. and Prescott, M.

(1999) Modern Database Management, Addison-Wesley.

IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004 65

Questions?

[email protected] of Information Management and

Systems, Monash UniversityT1.28, T Block, Caulfield Campus

9903 2735