Upload
murli-jha
View
1.061
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This slide gives you an overview about data ware house.
Citation preview
Data Warehousing
represented byMurli
Data Warehousing
•Aims of information technology:• To help workers in their everyday business activity
and improve their productivity – clerical data processing tasks
• To help knowledge Employee (executives, managers, analysts) make faster and better decisions – decision support systems
•Two types of applications:• Operational applications• Analytical applications
•In most organizations, data about specific parts of business is there - lots and lots of data, somewhere, in some form.•Data is available but not information -- and not the right information at the right time.•There is a need to
• bring together information .• off-load decision support applications from the on-line
transaction system
Data Warehousing (Contd..)
Data Warehouse
•“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” --- W. H. Inmon
•Collection of data that is used primarily in organizational decision making
•A decision support database that is maintained separately from the organization’s operational database
Data Warehouse - Subject Oriented
•Data that gives information about a particular subject.
•Data for Model& Analysis.
•Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
Data Warehouse – Integrated
•It Constructed by integrating multiple, heterogeneous data sources.
•Data cleaning and data integration techniques are applied.
•When data is moved to the warehouse, it is converted
-
•
Data Warehouse - Time Variant•Data is stable in a data warehouse.•Its adds historical as well as current data.
•Every key structure in the data warehouse - Contains an element of time, explicitly or implicitly
•But the key of operational data may or may not contain “time element”.
Data Warehouse - Non-Volatile
•A physically separate store of data transformed from the operational environment.•No update & delete on historical data .
•Operational update of data does not occur in the data warehouse•Appended •Initial loading of data and access of data.
Data modifications & schema design
•A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques.
•Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.
Why Separate Data Warehouse?
•Separate & historical data are needed for decision support.
•Complex decision .
•Missing Data.
•Data consolidation
.•Data quality.
Advantages of Data Warehousing
•High query performance•Queries not visible outside warehouse•Local processing at sources unaffected•Can operate when sources unavailable•Can query data not stored in a DBMS•Extra information at warehouse
• Modify, summarize (store aggregates)• Add historical information
Decision Support System• Information technology to help knowledge employees
(executives, managers, analysts) make faster and better decisions
• OLAP is an element of decision support system• Data mining is a powerful, high-performance data
analysis tool for decision support.
Three-Tier Decision Support Systems•Warehouse database server
• Almost always a relational DBMS, rarely flat files•OLAP servers
• Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators
• Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional data and operations
•Clients• Query and reporting tools• Analysis tools• Data mining tools
The Complete Decision Support System
Information Sources Data Warehouse Server(Tier 1)
OLAP Servers(Tier 2)
Clients(Tier 3)
Operational
DB’s
Semistructured
Sources
extracttransformloadrefreshetc.
Data Marts
DataWarehouse
e.g., MOLAP
e.g., ROLAP
serve
OLAP
Query/Reporting
Data Mining
serve
serve
Data Sources
•Data sources are often the operational systems, providing the lowest level of data.
•Data sources are designed for operational use, not for decision support, and the data reflect this fact.
•Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized.
•Multiple data sources introduce a large number of issues -- semantic conflicts.
Creating and Maintaining a Warehouse
•Data warehouse needs several tools that automate or support tasks such as:
• Data extraction from different external data sources, operational databases, files of standard applications
• Data cleaning (finding and resolving inconsistency in the source data)
• inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints.
• optional fields in data entry are significant sources of inconsistent data.
• Integration and transformation of data (between different data formats, languages, etc.)
• Data loading (loading the data into the data warehouse)• checking integrity constraints, sorting, summarizing, etc.
• Data replication (replicating source database into the data warehouse)
• used to incrementally refresh a warehouse when sources change
• Data refreshment• propagating updates on source data to the data stored in the warehouse• Periodically or immediately
• Data archiving
Creating and Maintaining a Warehouse
The Data Warehousing Models•Enterprise Warehouse
• collects all the information about subjects spanning entire organization
•Data Mart• a subset of corporate-wide data that is of value to a specific
group of users• its scope is confined to specific, selected groups, such as
marketing data mart• Independent Vs. Dependent (directly from warehouse) data
mart•Virtual warehouse
• a set of views over operational databases• only some summary views are materialized
Physical Structure of Data Warehouse
•There are three basic architectures for constructing a data warehouse:
• Centralized• Distributed• Federated• Tiered
•The data warehouse is distributed for: load balancing, scalability and higher availability
The logical data warehouse is only virtual
•The central data warehouse is physical•There exist local data marts on different tiers which store copies or summarization of the previous tier.
Physical Structure of Data Warehouse
(Contd..)
Data Processing Models•There are two basic data processing models:
• OLTP (On-Line Transaction Processing)
• Describes processing at operational sites
• aim is reliable and efficient processing of a large number
of transactions and ensuring data consistency.
• OLAP (On-Line Analytical Processing)
• Describes processing at warehouse
• aim is efficient multidimensional processing of large data
volumes.
OLTP vs. OLAP• OLTP OLAP
•users Clerk, IT professional Knowledge worker•Function day to day operations decision support•DB design application-oriented subject-oriented•data current, up-to-date historical, summarized• detailed, flat relational multidimensional• isolated integrated, consolidated•usage repetitive ad-hoc•access read/write, lots of scans• index/hash on prim. key•unit of work short, simple transaction complex query•# records accessed tens millions•#users thousands hundreds•DB size 100MB-GB 100GB-TB•metric transaction throughput query throughput, response
OLAP•Main goal: support ad-hoc but complex querying performed by business analysts•Interactive process of creating, managing, analyzing and reporting on data•Extends spreadsheet-like analysis to work with huge amounts of data in a data warehouse•Data exploration and aggregation in various ways•Typical applications include accessing the effectiveness of a marketing campaign, product sales forecasting, spot trends
•Allows a sophisticated user to analyse data using complex, multi-dimensional views•Place key performance indicators (measures) into context (dimensions)
• Measures are pre-aggregated• Data retrieval is significantly faster
•The proposed cube is made available to business analysts who can browse the data using a variety of tools, making ad hoc interatctive and analytical processing
OLAP (Contd..)
OLAP Server Architectures•Relational OLAP (ROLAP):
• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middleware to support missing pieces
• Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
• Greater scalability• schema design: Star, Snowflake, Fact Constellation
•Multidimensional OLAP (MOLAP):• Array based multidimensional storage engine (sparse matrix
techniques)• Fast indexing to pre-computed summarized data• Schema design: Cube
•Hybrid OLAP (HOLAP): • User flexibility - low level: relational, high level:array
ROLAP
•Special schema design: snow flake•Special indexes: bitmap, multi-table join•Proven technology (relational models, DBMS)
• Tend to outperform specialized MDDB especially on large data sets
•Products• IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
Measures and Dimensions
•Measures: key performance indicators that you want to evaluate
• Typically numerical, including volume, sales and cost• A rule of thumb: if a number makes business sense
when aggregated, then it is a measure• Examples
• Aggregate daily volume to month, quarter and year• Aggregating telephone numbers would not make sense-
not measures
• Affects what should be stored in the data warehouse
Measures and Dimensions (Contd..)
•Dimensions: categories of data analysis• Typical dimensions include product, time,
region
• A rule of thumb: when a report is requested “by” something, that something is usually a dimension
• Example• Sales report: view sales by month, by region
• Two dimensions needed are time and region
Conceptual Modeling•Star schema.•Snowflake schema.•Fact constellations, or Galaxy schema .
Star
Star Schema
•Fact table•Dimension tables•Measures
•A single fact table and for each dimension one dimension table•Does not capture hierarchies directly
time ite
m time_key day item_ke
y day_of_the_week
Sales Fact Table
item_name mont
h brand quarte
r time_key
type yea
r supplier_type
item_key branch_ke
y location
branch
location_key location_ke
y branch_key
units_sold
street
branch_name dollars_sol
d city
branch_type province_or_stree
t avg_sales
country
Measures
12
Example - Star Schema
Dimension Hierarchies
store
sType
city region
•snowflake schema•constellations
time ite
m time_key day item_ke
y supplier
Sales Fact Table
day_of_the_week
item_name
supplier_key
month
brand
time_key
supplier_type
quarter
type yea
r item_key
supplier_key branch_ke
y branch
location_key branch_ke
y location
units_sold
branch_name
location_key
dollars_sold
branch_type
city
street avg_sale
s city_key
city_key city
Measures
province_or_street country
13
•Represent dimensional hierarchy directly by normalizing tables. •Easy to maintain and saves storage
Example of Snowflake Schema
time Shipping Fact
Table item
time_key day time_ke
y item_key day_of_the_wee
k Sales Fact Table
item_name
item_key
month
brand quarte
r time_key
shipper_key
type yea
r supplier_type
item_key
from_location branch_ke
y to_location locatio
n branch
location_key
dollars_cost location_ke
y branch_key
units_sold
units_shipped
street
branch_name dollars_sol
d city
branch_type province_or_stree
t avg_sales
country
shipper Measure
s shipper_key shipper_name location_key shipper_type 1
4
•Multiple fact tables that share many dimension tables
Example of Fact Constellation
Aggregates
• Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
Aggregates (Contd..)• Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
• Add up amounts by day, productIn SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
drill-down
rollup
Aggregates (Contd..)
Points to be noticed about ROLAP
•Defines complex, multi-dimensional data with simple model•Reduces the number of joins a query has to process•Allows the data warehouse to evolve with relatively low maintenance•Can contain both detailed and summarized data.•ROLAP is based on familiar, proven, and already selected technologies.•BUT!!!•SQL for multi-dimensional manipulation of calculations.
Thank You