23
The Data Warehouse “A data warehouse is a subject- oriented, integrated, time- variant, and nonvolatile collection of “all” an organisation’s data in support of management’s decision making process.” Data warehouses developed because E.G.: if you want to ask “How much does this customer owe?” then the sales database is probably the one to use. However if you want to ask “Was this ad campaign more successful than that one?”, you require data from more disparate sources Other sources e.g.

The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support

Embed Size (px)

Citation preview

The Data Warehouse “A data warehouse is a subject-oriented,

integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support of management’s decision making process.”– Data warehouses developed because E.G.:– if you want to ask “How much does this customer

owe?” then the sales database is probably the one to use. However if you want to ask “Was this ad campaign more successful than that one?”, you require data from more disparate sources Other sources e.g. production, marketing etc.

Organizational Data Flow and Data Storage

Components

Characteristics of a Data Warehouse

• Subject oriented – organized based on use; e.g. business process

• Integrated – inconsistencies removed• Nonvolatile – stored in read-only format• Time variant – data are normally time series• Summarized – in decision-usable format • Large volume – data sets are quite large• Non normalized – often redundant:

Non-volatile and non normalised Data

• Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis.

• New data is always added as a supplement to the database, rather than a replacement.

• Data is non –normalised this is achieved using the star flake and similar schema’s…

© Pearson Education Limited 1995, 2005

A data warehouse process model

OperationalDatabase(s)

Decision Support SystemDataWarehouse

IndependentData Mart

ExternalData

ETL Routine(Extract/Transform/Load)

DependentData Mart

Extract/Summarize Data

Report

Data Warehousing Architecture• Fusion and cleansing: sourcing,

acquisition, cleanup and transformation of data– Implementing data warehouses involves

extracting data from operational systems including legacy systems and putting it into a suitable format.

– These tools perform all the conversions, summarisations, key changes, structural changes, and condensations needed to transform disparate data into information can be used by decision support tools

Data in a Data Warehouse are Integrated

Meta Data• A key concept behind D.W. is Meta Data.

– Meta data is data about the data (which has come from the data sources) and shows what data is contained in the DW, where it came from, and what changes have been made to it.

• The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data.

– For example, a line in a sales database may contain: 1023 K596 111.21

– This is mostly meaningless until we consult the metadata (in the data directory) that tells us it was store number 1023, product K596 and sales of $111.21.

Data marts

• A data mart is a data store that is subsidiary to a data warehouse of integrated data.

• The data mart is directed at a partition of data (subject area) that is created for the use of a dedicated group of users and is sometimes termed a “subject warehouse”

• The data mart might be a set of denormalised, summarised or aggregated data that can be placed on the data warehouse database or more often placed on a separate physical store.

• Data marts can be “dependent data marts” when the data is sourced from the data warehouse.

• Independent data marts represent fragmented solutions to a range of business problems in the enterprise, however, such a concept should not be deployed as it doesn’t have the “data integration” concept that’s associated with data warehouses.

Data Warehousing Typology

– THE D.W. can be at single location i.e. a central data warehouse

– or – The collection of data is replicated around multiple

locations. This means users have a local copy of the data warehouse. This can improve query run-times, and reduce communications overheads. Distributed Data warehouse (Note: The principles associated with distributed database equally apply to Distributed Data warehouses ) .

Data Warehousing Design

DT211/4

12

Designing Data Warehouses• Need to find answers for questions such

as:– Which user requirements are most important?– which data should be considered first….

• The database component of a data warehouse is described using a technique called dimensionality modelling.

13

Dimensionality modeling • A logical design technique that aims to

present the data in a standard, intuitive form that allows for high-performance access

• Every dimensional model (DM) is composed of one table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables.

14

Fact and dimension tables for each business process of DreamHome

15

ER model of property sales business process of DreamHome

16

Star schema for property sales of DreamHome

17

Dimensionality modeling• Star schema is a logical structure that has a fact table

containing factual data in the centre, surrounded by dimension tables containing reference data, which can be denormalised.

• For example: dimension tables (propertyfor sale, client, branch and staff) all have city region and county repeated.

18

Dimensionality modeling• Star schemas can be used to speed up query

performance by denormalizing reference information into a single dimension table.

• For example: dimension tables (propertyfor sale, client, branch and staff) all have city region and county repeated.

19

Database Design Methodology for Data Warehouses

• ‘Methodology’ includes following steps:– Choosing the process – Choosing the facts and dimensions – Choosing the facts – Storing pre-calculations in the fact table –

20

Choosing the process

• The process (function) refers to the subject matter of a particular data warehouse: to answer the most commercially important business questions .

• Identify the discrete business processes; For example: property sales.

21

ER model of property sales business process of DreamHome

22

Choosing the facts • Decide what a record of the fact table is to

represents: e.g. Property sales. • Facts should be numeric and additive. • Identify dimensions of the fact table. The

contents for the fact table also determines the contents for each dimension table.

• Dimensions set the context for asking questions about the facts in the fact table; Clientbuyer: clientno., client name, city, region, county.

23

Star schema for property sales of DreamHome