Upload
wendy-jackson
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
ITEC 423 Data Warehousing and ITEC 423 Data Warehousing and Data MiningData MiningLecture 2Lecture 2
What is a data warehouse?What is a data warehouse?
“A data warehouse is a subject-oriented, Integrated (consolidated) time-variant, and nonvolatile collection of data in support of
management’s decision-making process.”W. H. Inmon
Subject-Oriented DataSubject-Oriented Data
Warehouse is organized around major subjects of the enterprise rather than major application areas
Contains decision-support data rather than application-oriented data.
The focus of the design is:providing users easy access to data so that current and future questions can be answered
CustomersProductssales
Customer invoicingStock control
Application-Oriented vs Application-Oriented vs Subject-OrientedSubject-Oriented
Integrated or Consolidated Integrated or Consolidated DataData
Integrates corporate level application-oriented data from different source systems data is often inconsistent or missing
Integrated data source must be made consistent to present a unified view of the data to the users.
Integrated DataIntegrated Data
Time-Variant DataTime-Variant Data
Data in the warehouse is only accurate and valid at some point in time or over some time interval.
contains slices of data across different periods of time. With these data slices, the user can view
current and past reports. data represents a series of snapshots.
Time-variance is also shown in the extended time that data is stored contains several years’ worth of data implicit or explicit association of time with
all data
This is necessary to support trending,
forecasting, and time-based
performance reporting, such as current year versus previous
year.
Non-Volatile DataNon-Volatile Data
Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis
New data is always added as a supplement to the database, rather than a replacement.
Data GranularityData Granularityoperational database data is usually kept at the lowest level of detail
the units of sale are captured at the level of units of a product per transaction at the check -out counter.
the quantity ordered is captured and stored at the level of units of a product per order received from the customer.
If summary data is needed, the individual transactions are grouped.
data warehouse Initial requests are for summary data to use
for analysis. total sale units of a product in an entire region.
Progessively more details may be required breakdown by states in the region. Examine sale units at individual stores.
Data GranularityData Granularity Is the level of detail keep data summarized at different levels of detail
Typical Properties of a data Typical Properties of a data warehousewarehouseA data warehouse is housed on an enterprise
mainframe server.
Data from various online transaction processing (OLTP) applications and other sources is selectively extracted and organized Read-only Copy re-structured
Data warehouse database is used for processing analytical applications and user queries.
OLTP vs. WarehousingOLTP vs. Warehousing
Organized by transactions vs. Organized by particular subject
More number of users vs. less Accesses few records vs. entire table Smaller database vs. Large database Normalized data structure vs.
Unnormalized Continuous update vs. periodic update
(load)
Data Warehouse Compared to Data Warehouse Compared to OLTPOLTP
PROPERTY OLTP DATA WAREHOUSE
ACTIVITIES Processes Analysis
RESPONSE TIME Subsecondsto seconds
Seconds to hours
OPERATIONS DML Primarily read-only
NATURE OFDATA
30-60 days Snapshots over time
DATA ORGANIZE. By application By subject,time
SIZE Small to large Large to very large
DATA SOURCES Operational Internal
Operational,Internal, External
USAGE CURVEE Predictable Unpredictable
Data Warehouse or Data Data Warehouse or Data MartMart
Data Warehouse Compared to Data Warehouse Compared to Data MartData Mart
Data Warehouse
Data Mart
Property Data WarehouseData Mart
Scope Enterprise Department
Subjcts Multiple Single-subject, line of business (LOB)
Data Source Many Few
Size (typical) 100 GB to > 1 TB < 100 GB
Implementationtime
Months to years Months
Which one to build first?Which one to build first?Data warehouse or Data Data warehouse or Data
Mart?Mart?
Top Down ApproachTop Down Approach
Top Down ApproachTop Down Approach
Build the overall, big, enterprise- wide data warehouse. Instead of collection of fragmented islands of
information. Data warehouse is large and integrated. would take longer to build and has a high risk of
failure. If you do not have experienced professionals on your
team, this approach could be hazardous. Difficult to sell this approach to senior
management and sponsors They are not likely to see results soon enough.
Pros and Cons of Top Pros and Cons of Top Down ApproachDown ApproachAdvantages Disadvantages
A truly corporate effort, an enterprise view of data
Inherently architected, not a union of disparate data marts
Single, central storage of data about the content
Centralized rules and control
May see quick results if implemented with iterations
Takes longer to build even with an iterative method
High exposure to risk of failure
Needs high level of cross-functional skills
High outlay without proof of concept
Bottom Up ApproachBottom Up Approach
Bottom Up ApproachBottom Up Approach Build departmental data marts one by one
based on priority Collection of data marts make up the data
warehouse Beware of data fragmentation Independent data marts are blind to the overall
requirements of the entire organization. Data marts contain
data at the lowest level of granularity summaries depending on the needs for analysis
Data marts are joined or “unioned” together by conforming the dimensions
Pros and Cons of Bottom Pros and Cons of Bottom Up ApproachUp ApproachAdvantages Disadvantages
Faster and easier implementation of manageable pieces
Favorable return on investment and proof of concept
Less risk of failure Inherently incremental
can schedule important data marts first
Allows project team to learn and grow
Each data mart has its own narrow view of data
Permeates redundant data in every data mart
Perpetuates inconsistent and irreconcilable data
Proliferates unmanageable interfaces
Architectural TypesArchitectural Types
Architectural TypesArchitectural Types
Centralized Data Warehouse: Takes into account the enterprise-level information requirements Atomic level data at the lowest level of granularity is stored Some summarized data may be included Queries and applications access the central data warehouse. No separate data martsIndependent Data Marts Evolves in companies where the organizational units develop
their own data marts for their own specific purposes Each data mart serves a particular organizational unit More than one version of the truth may be found Data marts are independent of one another Different data marts may have inconsistent data definitions and
standards Such variances hinder analysis of data across data marts.
Architectural TypesArchitectural TypesFederated An existing legacy of an assortment of DSS in the form of operational
systems, extracted datasets, primitive data marts, … May not be possible to discard investment and start from scratch Practical solution is a federated architectural type data may be physically or logically integrated through shared key
fields, overall global metadata , distributed queries, and such other methods
No one overall data warehouseData-Mart Bus Conformed supermarts approach Analyzing requirements for a specific business subject such as orders,
shipments, billings, insurance claims, car rentals, and ... Build the first data mart (supermart) using business dimensions and
metrics These business dimensions will be shared in the future data marts. Conform dimensions among the various data marts Result would be logically integrated supermarts that will provide an
enterprise view of the data Data marts contain atomic data organized as a dimensional data model Results from adopting an enhanced bottom-up approach to data
warehouse development
Architectural TypesArchitectural TypesHub-and-S poke Similar to the centralized data warehouse architecture Overall enterprise-wide data warehouse Atomic data is stored in the centralized data warehouse Major and useful difference is the presence of dependent data
marts in this architectural type Dependent data marts obtain data from the centralized data
warehouse The centralized data warehouse forms the hub to feed data to the
data marts on the spokes Dependent data marts may be developed for a variety of
purposes: departmental analytical needs, specialized queries, data mining,
and ... Dependent data mart may have normalized, denormalized,
summarized, or dimensional data structures based on individual requirements
Most queries are directed to the dependent data marts although the centralized data warehouse may itself be used for querying
Result s from adopting a top-down approach to data warehouse development.
Building Blocks of Data Building Blocks of Data WarehousesWarehouses
What is OLAP?What is OLAP? online analytical processing Approach to answer multi-dimensional
analytical queries. part of the broader category of
business intelligence: reporting, data mining.
Applications include: business reporting for sales, management reporting, budgeting forecasting
Typical Data Warehousing ProcessTypical Data Warehousing Process
Phase I - STRATEGYIdentify business requirementsDefine objectives & purpose of DW Phase II - DEFINITION
Project scoping and planning: Using building block approach
Phase III - ANALYSISInformation requirements are definedPhase IV - DESIGN
Database structures to hold basedata and summaries are created;Translation mechanisms are designed Phase V - BUILD & DOCUMENT
The warehouse is built and documentation is developed
Phase VI - POPULATE, TEST & TRAINThe warehouse is populated andtested the users are trained on system and tools
Phase VII - DISCOVERY & EVOLUTIONThe warehouse is monitored andadjustments are applied, or future extensions are planned
Iterative
What Does All This Mean?What Does All This Mean?
On a daily basis, organizations turn to their data warehouses to answer a limitless variety of questions.
Nothing is free these benefits do come with a cost.
The value of a data warehouse is a result of the new and changed business processes it enables.
There are limitations A DW cannot correct problems with the data,
although it may help to clearly identify them.
Comparison of Typical DW Costs and Comparison of Typical DW Costs and BenefitsBenefitsCosts Hardware, software, development personnel and consultant
costs. Operational costs like ongoing systems maintenance. Benefits Added Revenue Will the new (business objective) process generate new
customers (what is the estimated value?) Will the new (business objective) process increase the buying
propensity of existing customers (by how much?) Is the new process necessary to ensure that the competition
doesn't offer a demanded service that you can't match? Reduced costs What costs of current systems will be eliminated? Is the new process intended to make some operation more
efficient? If so, how and what is the dollar value?
The Cost of Warehousing DataThe Cost of Warehousing Data
Expenditures can be categorized as one-time initial costs or as recurring, ongoing costs.
The initial costs can further be identified as for hardware or software.
Expenditures can also be categorized as capital costs (associated with acquisition of the warehouse) or as operational costs (associated with running and maintaining the warehouse)
Recurring Costs One-Time Costs
Capital Hardware maintenance Software maintenance Terminal analysis Middleware
Hardware Software Disk DBMS CPU Terminal analysis Network Middleware Terminal analysis Network Log utility Processing Metadata Infrastructure
Operational Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW
Integration/transformation processing specification
Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition
Expenditures Associated with Building a Expenditures Associated with Building a DWDW
Cost is Highly VariableCost is Highly Variable
A company that spends less money for their data warehouse is often happier with it.
The main justification for the development expense is that a DW reduces the cost of accessing the information owned by the organization.
Since information has to be retrieved just once (when it is placed in the warehouse), DW users see a lower cost on each report generated.
Typical Multidatabase Report and Screen Typical Multidatabase Report and Screen GenerationGeneration
SourceSystem
A
SourceSystem
B
SourceSystem
C
SourceSystem
D
Data download and transformation contribute to retrieval costs for every report or screen generated
Typical DW Report and Screen Typical DW Report and Screen GenerationGeneration
SourceSyste
mA
SourceSyste
mB
SourceSyste
mC
SourceSyste
mD
Organizational
DataWarehouse
Data upload and
transformation costs occur just once.
Retrieval costs are lower.
Farmers and ExplorersFarmers and Explorers
Every corporation has two types of DW users. Farmers know what they want before they
set out to find it. They submit small queries and retrieve small nuggets of information.
Explorers are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless nuggets.
Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable.
Data Marts and the Data Data Marts and the Data WarehouseWarehouse
Organizational
DataWarehouse
FinanceData Mart
Accounting
Data Mart
Marketing
Data Mart
SalesData MartOperation
al Data Store
Operational Data Store
Operational Data Store
Operational Data Store
Legacy Systems
Legacy systems feed data to the warehouse.
The warehouse feeds specialized information to departments.
The Data Mart is More The Data Mart is More SpecializedSpecialized
Organizational
DataWarehouse
FinanceData Mart
AcctingData Mart
Marketing
Data Mart
SalesData Mart
Data Marts
DepartmentalizedSummarized, aggregated dataStar join designLimited historical dataLimited data volumeRequirements driven dataFocused on departmental needsMulti-dimensional DBMS technologies
Organizational Data Warehouse
CorporateHighly granular dataNormalized designRobust historical dataLarge data volumeData Model driven dataVersatileGeneral purpose DBMS technologies
The data mart
serves the needs
of one business unit, not
the organizati
on.
Foundations of Data MiningFoundations of Data Mining
Data mining is the process of using raw data to infer important business relationships.
Despite a consensus on the value of data mining, a great deal of confusion exists about what it is.
It is a collection of powerful techniques intended for analyzing large datasets.
There is no single data mining approach, but rather a set of techniques that can be used in combination with each other.
The Roots of Data MiningThe Roots of Data Mining
The approach has roots in practice dating back over 30 years.
In the early 1960s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS.
By the 1980s, the traditional techniques had been augmented by new methods such as fuzzy logic, heuristics and neural networks.
A General ApproachA General Approach
Although all data mining endeavors are unique, they possess a common set of process steps:
1. Infrastructure preparation – choice of hardware platform, the database system and one or more mining tools
2. Exploration – looking at summary data, sampling and applying intuition
3. Analysis – each discovered pattern is analyzed for significance and trends
A General Approach A General Approach (continued)(continued)4. Interpretation – Once patterns have been
discovered and analyzed, the next step is to interpret them. Considerations include business cycles, seasonality and the population the pattern applies to.
5. Exploitation – this is both a business and a technical activity. One way to exploit a pattern is to use it for prediction. Others are to package, price or advertise the product in a different way.
Review VocabularyReview Vocabulary
Data warehouseData martOLTPOLAPDimensional ModelSubject OrientedTime variantNon volatile Integrated/consolidate