40
Data Warehousing represented by Murli

Data warehouse intoduction

Embed Size (px)

DESCRIPTION

This slide gives you an overview about data ware house.

Citation preview

Page 1: Data warehouse intoduction

Data Warehousing

represented byMurli

Page 2: Data warehouse intoduction

Data Warehousing

•Aims of information technology:• To help workers in their everyday business activity

and improve their productivity – clerical data processing tasks

• To help knowledge Employee (executives, managers, analysts) make faster and better decisions – decision support systems

•Two types of applications:• Operational applications• Analytical applications

Page 3: Data warehouse intoduction

•In most organizations, data about specific parts of business is there - lots and lots of data, somewhere, in some form.•Data is available but not information -- and not the right information at the right time.•There is a need to

• bring together information .• off-load decision support applications from the on-line

transaction system

Data Warehousing (Contd..)

Page 4: Data warehouse intoduction

Data Warehouse

•“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” --- W. H. Inmon

•Collection of data that is used primarily in organizational decision making

•A decision support database that is maintained separately from the organization’s operational database

Page 5: Data warehouse intoduction

Data Warehouse - Subject Oriented

•Data that gives information about a particular subject.

•Data for Model& Analysis.

•Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Page 6: Data warehouse intoduction

Data Warehouse – Integrated

•It Constructed by integrating multiple, heterogeneous data sources.

•Data cleaning and data integration techniques are applied.

•When data is moved to the warehouse, it is converted

-

Page 7: Data warehouse intoduction

Data Warehouse - Time Variant•Data is stable in a data warehouse.•Its adds historical as well as current data.

•Every key structure in the data warehouse - Contains an element of time, explicitly or implicitly

•But the key of operational data may or may not contain “time element”.

Page 8: Data warehouse intoduction

Data Warehouse - Non-Volatile

•A physically separate store of data transformed from the operational environment.•No update & delete on historical data .

•Operational update of data does not occur in the data warehouse•Appended •Initial loading of data and access of data.

Page 9: Data warehouse intoduction

Data modifications & schema design

•A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques.

•Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.

Page 10: Data warehouse intoduction

Why Separate Data Warehouse?

•Separate & historical data are needed for decision support.

•Complex decision .

•Missing Data.

•Data consolidation

.•Data quality.

Page 11: Data warehouse intoduction

Advantages of Data Warehousing

•High query performance•Queries not visible outside warehouse•Local processing at sources unaffected•Can operate when sources unavailable•Can query data not stored in a DBMS•Extra information at warehouse

• Modify, summarize (store aggregates)• Add historical information

Page 12: Data warehouse intoduction

Decision Support System• Information technology to help knowledge employees

(executives, managers, analysts) make faster and better decisions

• OLAP is an element of decision support system• Data mining is a powerful, high-performance data

analysis tool for decision support.

Page 13: Data warehouse intoduction

Three-Tier Decision Support Systems•Warehouse database server

• Almost always a relational DBMS, rarely flat files•OLAP servers

• Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators

• Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional data and operations

•Clients• Query and reporting tools• Analysis tools• Data mining tools

Page 14: Data warehouse intoduction

The Complete Decision Support System

Information Sources Data Warehouse Server(Tier 1)

OLAP Servers(Tier 2)

Clients(Tier 3)

Operational

DB’s

Semistructured

Sources

extracttransformloadrefreshetc.

Data Marts

DataWarehouse

e.g., MOLAP

e.g., ROLAP

serve

OLAP

Query/Reporting

Data Mining

serve

serve

Page 15: Data warehouse intoduction

Data Sources

•Data sources are often the operational systems, providing the lowest level of data.

•Data sources are designed for operational use, not for decision support, and the data reflect this fact.

•Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized.

•Multiple data sources introduce a large number of issues -- semantic conflicts.

Page 16: Data warehouse intoduction

Creating and Maintaining a Warehouse

•Data warehouse needs several tools that automate or support tasks such as:

• Data extraction from different external data sources, operational databases, files of standard applications

• Data cleaning (finding and resolving inconsistency in the source data)

• inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints.

• optional fields in data entry are significant sources of inconsistent data.

Page 17: Data warehouse intoduction

• Integration and transformation of data (between different data formats, languages, etc.)

• Data loading (loading the data into the data warehouse)• checking integrity constraints, sorting, summarizing, etc.

• Data replication (replicating source database into the data warehouse)

• used to incrementally refresh a warehouse when sources change

• Data refreshment• propagating updates on source data to the data stored in the warehouse• Periodically or immediately

• Data archiving

Creating and Maintaining a Warehouse

Page 18: Data warehouse intoduction

The Data Warehousing Models•Enterprise Warehouse

• collects all the information about subjects spanning entire organization

•Data Mart• a subset of corporate-wide data that is of value to a specific

group of users• its scope is confined to specific, selected groups, such as

marketing data mart• Independent Vs. Dependent (directly from warehouse) data

mart•Virtual warehouse

• a set of views over operational databases• only some summary views are materialized

Page 19: Data warehouse intoduction

Physical Structure of Data Warehouse

•There are three basic architectures for constructing a data warehouse:

• Centralized• Distributed• Federated• Tiered

•The data warehouse is distributed for: load balancing, scalability and higher availability

Page 20: Data warehouse intoduction

The logical data warehouse is only virtual

•The central data warehouse is physical•There exist local data marts on different tiers which store copies or summarization of the previous tier.

Physical Structure of Data Warehouse

(Contd..)

Page 21: Data warehouse intoduction

Data Processing Models•There are two basic data processing models:

• OLTP (On-Line Transaction Processing)

• Describes processing at operational sites

• aim is reliable and efficient processing of a large number

of transactions and ensuring data consistency.

• OLAP (On-Line Analytical Processing)

• Describes processing at warehouse

• aim is efficient multidimensional processing of large data

volumes.

Page 22: Data warehouse intoduction

OLTP vs. OLAP• OLTP OLAP

•users Clerk, IT professional Knowledge worker•Function day to day operations decision support•DB design application-oriented subject-oriented•data current, up-to-date historical, summarized• detailed, flat relational multidimensional• isolated integrated, consolidated•usage repetitive ad-hoc•access read/write, lots of scans• index/hash on prim. key•unit of work short, simple transaction complex query•# records accessed tens millions•#users thousands hundreds•DB size 100MB-GB 100GB-TB•metric transaction throughput query throughput, response

Page 23: Data warehouse intoduction

OLAP•Main goal: support ad-hoc but complex querying performed by business analysts•Interactive process of creating, managing, analyzing and reporting on data•Extends spreadsheet-like analysis to work with huge amounts of data in a data warehouse•Data exploration and aggregation in various ways•Typical applications include accessing the effectiveness of a marketing campaign, product sales forecasting, spot trends

Page 24: Data warehouse intoduction

•Allows a sophisticated user to analyse data using complex, multi-dimensional views•Place key performance indicators (measures) into context (dimensions)

• Measures are pre-aggregated• Data retrieval is significantly faster

•The proposed cube is made available to business analysts who can browse the data using a variety of tools, making ad hoc interatctive and analytical processing

OLAP (Contd..)

Page 25: Data warehouse intoduction

OLAP Server Architectures•Relational OLAP (ROLAP):

• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middleware to support missing pieces

• Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

• Greater scalability• schema design: Star, Snowflake, Fact Constellation

•Multidimensional OLAP (MOLAP):• Array based multidimensional storage engine (sparse matrix

techniques)• Fast indexing to pre-computed summarized data• Schema design: Cube

•Hybrid OLAP (HOLAP): • User flexibility - low level: relational, high level:array

Page 26: Data warehouse intoduction

ROLAP

•Special schema design: snow flake•Special indexes: bitmap, multi-table join•Proven technology (relational models, DBMS)

• Tend to outperform specialized MDDB especially on large data sets

•Products• IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

Page 27: Data warehouse intoduction

Measures and Dimensions

•Measures: key performance indicators that you want to evaluate

• Typically numerical, including volume, sales and cost• A rule of thumb: if a number makes business sense

when aggregated, then it is a measure• Examples

• Aggregate daily volume to month, quarter and year• Aggregating telephone numbers would not make sense-

not measures

• Affects what should be stored in the data warehouse

Page 28: Data warehouse intoduction

Measures and Dimensions (Contd..)

•Dimensions: categories of data analysis• Typical dimensions include product, time,

region

• A rule of thumb: when a report is requested “by” something, that something is usually a dimension

• Example• Sales report: view sales by month, by region

• Two dimensions needed are time and region

Page 29: Data warehouse intoduction

Conceptual Modeling•Star schema.•Snowflake schema.•Fact constellations, or Galaxy schema .

Page 30: Data warehouse intoduction

Star

Page 31: Data warehouse intoduction

Star Schema

•Fact table•Dimension tables•Measures

•A single fact table and for each dimension one dimension table•Does not capture hierarchies directly

Page 32: Data warehouse intoduction

time ite

m time_key day item_ke

y day_of_the_week

Sales Fact Table

item_name mont

h brand quarte

r time_key

type yea

r supplier_type

item_key branch_ke

y location

branch

location_key location_ke

y branch_key

units_sold

street

branch_name dollars_sol

d city

branch_type province_or_stree

t avg_sales

country

Measures

12

Example - Star Schema

Page 33: Data warehouse intoduction

Dimension Hierarchies

store

sType

city region

•snowflake schema•constellations

Page 34: Data warehouse intoduction

time ite

m time_key day item_ke

y supplier

Sales Fact Table

day_of_the_week

item_name

supplier_key

month

brand

time_key

supplier_type

quarter

type yea

r item_key

supplier_key branch_ke

y branch

location_key branch_ke

y location

units_sold

branch_name

location_key

dollars_sold

branch_type

city

street avg_sale

s city_key

city_key city

Measures

province_or_street country

13

•Represent dimensional hierarchy directly by normalizing tables. •Easy to maintain and saves storage

Example of Snowflake Schema

Page 35: Data warehouse intoduction

time Shipping Fact

Table item

time_key day time_ke

y item_key day_of_the_wee

k Sales Fact Table

item_name

item_key

month

brand quarte

r time_key

shipper_key

type yea

r supplier_type

item_key

from_location branch_ke

y to_location locatio

n branch

location_key

dollars_cost location_ke

y branch_key

units_sold

units_shipped

street

branch_name dollars_sol

d city

branch_type province_or_stree

t avg_sales

country

shipper Measure

s shipper_key shipper_name location_key shipper_type 1

4

•Multiple fact tables that share many dimension tables

Example of Fact Constellation

Page 36: Data warehouse intoduction

Aggregates

• Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

Page 37: Data warehouse intoduction

Aggregates (Contd..)• Add up amounts by day

In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

Page 38: Data warehouse intoduction

• Add up amounts by day, productIn SQL: SELECT date, sum(amt) FROM SALE

GROUP BY date, prodId

drill-down

rollup

Aggregates (Contd..)

Page 39: Data warehouse intoduction

Points to be noticed about ROLAP

•Defines complex, multi-dimensional data with simple model•Reduces the number of joins a query has to process•Allows the data warehouse to evolve with relatively low maintenance•Can contain both detailed and summarized data.•ROLAP is based on familiar, proven, and already selected technologies.•BUT!!!•SQL for multi-dimensional manipulation of calculations.

Page 40: Data warehouse intoduction

Thank You