Upload
vocong
View
222
Download
2
Embed Size (px)
Citation preview
Data Warehouse – Part 01
Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial-Based Primer by Roiger and Mining: A Tutorial Based Primer by Roiger and
Geatz
1
What’s the Problem with Data?What’s the Problem with Data?
2http://www.techrepublic.com/whitepapers/surviving-the-data-explosion-through-data-reduction/1125783?tag=content;siu-container
Why No Just Use Operational Dbs?Why No Just Use Operational Dbs?
Operational Decision Support SystemsOperational Decision Support Systems
Transactional OLTP
For example, systems that support decisions through OLTP
Transaction-oriented, i.e., designed for
support decisions through data mining
Subject-orientedg Quick processing of an
individual transaction h
j
e.g. a purchase
3
Review – Process for Building an O ti l DbOperational Db First Step: data modeling – create the entity relationship
diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model There is no consideration for use in the data model
4
ERDsERDs
5
http://www.sqlservercentral.com/articles/Miscellaneous/designadatabaseusinganentityrelationshipdiagram/1159/
http://www.umsl.edu/~sauterv/analysis/er/er_intro.html
Entity RelationshipEntity - Relationship
Entity RelationshipEntity Relationship
Concept Represents a class of
Between two entities
One to oneRepresents a class of persons, places, things
May have attributes
One-to-one Husband-to-wife in US
culture and society (at any Some combination of
attributes can uniquely identify each instance of an
y yone time)
One-to-manyidentify each instance of an entity Key
Father-to-child
Many-to-many Student to teacher
6
Student-to-teacher
Sample of Credit Card Promotion Data (f T bl 2 3)(from Table 2.3)Income Range
Magazine Promo
Watch Promo
Life InsPromo
CC Ins Sex AgeRange Promo Promo Promo
40-50K Yes No No No Male 45
30-40K Yes Yes Yes No Female 40
40 0 l 4240-50K No No No No Male 42
30-40K Yes Yes Yes Yes Male 43
50-60K Yes No Yes No Female 38
20-30K No No No No Female 55
30-40K Yes No Yes Yes Male 35
20-30K No Yes No No Male 2720 30K No Yes No No Male 27
30-40K Yes No No No Male 43
30-40K Yes Yes Yes No Female 41
7 e.g., What is the cardinality of the relationship customer-to-promotion?
Review – Process for Building an O ti l DbOperational Db First Step: data modeling – create the entity relationship
diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model There is no consideration for use in the data model
Second Step: normalization (db normalization not mathematical normalization))
8
NormalizationNormalization Reduces duplication of data within tables
Result is more tables with fewer columns per table
Effective NormalizationEffective Normalization Improves data integrity/validity by reducing data redundancy
Faster sorting of data
Queries run efficiently
Can Have Too Much NormalizationCan Have Too Much Normalization Too many relationships
Too many slim, small tables
To retrieve one piece of information requires access to many bl h h tables through many joins Compromises performance Compromised maintenanceCompromised maintenance
NormalizationNormalization
A formal processp
12
First Normal Form (1NF)First Normal Form (1NF) Eliminate any repeating groups of information (a row-
column intersection contains only one value, not a list)
No duplicate rows (a primary key can be assigned)
(P bl ) T bl E l(Problem) Table: EmployeeEmployee_ID Last_Name Children
100 Patel Babaraj, Salleh, Sara110 Washington Martha, Ted120 Cortez Sam, Jorge
Second Normal Form (2NF)Second Normal Form (2NF) 1NF plus
Each column within the table must depend on the whole primary key
(P bl ) T bl C D t il(Problem) Table: Course_DetailsPrefix Course Credits College
CIS 3320 3 TechnologygyCIS 4380 3 TechnologyCHEM 3505 5 NSMMIS 3320 3 BusinessMIS 3320 3 Business
Third Normal Form (3NF)Third Normal Form (3NF) 2NF plus
No column is dependant on any other column within the table that is not defined as a key.
N d d d f h d h h bl No data is derived from other data within the table.
(Problem) Table: Course_SectionS i I Offi Ph N bSection Instructor Office Phone Number
12345 100-4 M 355 5-701112467 101-6 B 424 5-632215083 100-4 M 355 5-701116078 210-8 B 434 5-332156701 101-6 B 424 5-632256701 101 6 B 424 5 632212554 100-12 M 201 A 5-7337
Relational ModelRelational ModelEntities are realized as two-dimensional tables where the columns are the
ib f h i d h d i f ( l f) h iattributes of the entity and the rows are data instances of (examples of) the entity.Relationships between entities are realized as relationship that maps the primary key attribute set of one table to one or more columns of the related entity’s table.
16
C id T ti O i t ti Consider Transaction Orientation vs…Table: Course Section
S ti I t tSection Instructor
12451 100-4
19372 101-7
10029 100-12
12452 100-4
T bl G d dTable: Grade Record
Student Section Grade
1093456 12451 B
1184567 12452 B
2341100 10029 C
1972344 10029 D
17
1972344 10029 D
Subject OrientationSubject OrientationTable: Grade_by_Instructor
St d t G d S ti I t tStudent Grade Section Instructor
1093456 B 12451 100-4
84 67 24 2 00 41184567 B 12452 100-4
2341100 C 10029 100-12
1972344 D 10029 100-12
1093456 C 10029 100-12
1184567 C 10029 100-12
2341100 A 12451 100-42341100 A 12451 100 4
1972344 B 12452 100-4
18
Data Warehouse DesignData Warehouse Design
“A data warehouse is a subject-oriented, integrated, time-variant, gand nonvolatile collection of data in support of management’s decision making process.”*
19*Inmon, W. H. (1996). Building the Data Warehouse. New York: John Wiley and Sons, Inc.
OLTP vs Data WarehouseOLTP vs. Data Warehouse
Data Warehouse OLTPData Warehouse OLTP
Subject-oriented Denormalized integrated
Process-oriented (or transaction-oriented)Denormalized, integrated
Stores data to be reported on, analyzed, tested
Normalized, separated Stores data to be processed,
collected managed Data is historical, no
longer used in operations Data is static
collected, managed Data is necessary for day-to-
day operations of the business Data is static Granularity is a design
issue
Data will be updated Granularity to the most
detailed level
20
Sources of DW DataSources of DW Data External data Data not specific to the organization Economic indicators, weather
O ti l d t Operational data From the OLTP system
Independent data mart Independent data mart Like a data warehouse only focuses on one subject Belongs to the organization – but maybe to a different g g y
department
21
ETLETL Extract –Transform – Load A routine whereby data is brought into the data warehouse from
other sources
Transform Transform Data cleaning Resolve granularity issuesg y Correct data inconsistencies Time-stamp data records
22
Data in a DW is StaticData in a DW is Static Once data is in the data warehouse it is read-only
Not always true
23
Data Warehouse FormatData Warehouse Format
Multidimensional array of data – not based on relational modely
Star schema – based on relational model
24
Star SchemaStar Schema
25
Star SchemaStar Schema
26
http://www.executionmih.com/data-warehouse/star-snowflake-schema.php
Fact TableFact Table Defines the dimensions of the multi-dimensional space being
created
Each record in a fact table contains two types of data F t Facts Dimension keys
Fact table key is a composite key made of keys for each Fact table key is a composite key made of keys for each dimension table
27
Dimension TableDimension Table Data specific to a dimension
One-to-many relation from dimension table to fact table
28
Multidimensional DatabaseMultidimensional Database Definition “A multidimensional database
is structured around measures dimensions measures, dimensions, hierarchies, and cubes rather than tables, rows, columns,
d l ”and relations.” Larson, B. (2008). Delivering
Business Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.
http://gerardnico.com/wiki/database/database_multidimensional
29
[Data] Cube[Data] Cube
Definition “A cube is a structure that contains a value Definitionfor one or more measures for each unique combination of the members of all its d Th d l l f l l dimensions. These are detail, or leaf-level values. The cube also contains aggregated values formed by the dimension hierarchies yor when one or more of the dimensions is left out of the hierarchy.” Larson, B. (2008). Delivering Business
Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.
30
A Point Within a Cube is a Value of the M (th F t)Measure (the Fact) The intersection of all
dimensions is a point
That point represents a value of the measure for value of the measure for the particular unique combination of dimension values
The point is called a detail for leaf-level value
31
Snowflake SchemaSnowflake Schema
Dimension tables are normalized – hence they are broken down into ytwo or more tables
32
Constellation SchemaConstellation Schema
More than one fact table
33
Data Warehouse – Part 01
Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial-Based Primer by Roiger and Mining: A Tutorial Based Primer by Roiger and
Geatz
34