25
CISB594 – Business CISB594 – Business Intelligence Intelligence Data Warehousing Data Warehousing Part I Part I

CISB594 – Business Intelligence Data Warehousing Part I

Embed Size (px)

Citation preview

Page 1: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data WarehousingData WarehousingPart IPart I

Page 2: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

ReferenceReference• Materials used in this presentation are extracted mainly from

the following texts, unless stated otherwise.

Page 3: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Additional ReferenceAdditional Reference

“Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber

Page 4: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

ObjectivesObjectives

At the end of this lecture, you should be able to:• Understand the basic definitions and concepts of data

warehouses• Understand how a data warehouse differs from a database• Describe the characteristics of data warehouse• Describe data warehouse process overview• Describe the different types of data warehouse architectures

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Page 5: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data WarehouseData Warehouse• “The data warehouse is a collection of integrated, subject-

oriented databases design to support DSS functions, where each unit of data is non-volatile and relevant to some moment in time”

• A data warehouse is a repository of an organization's electronically stored data, designed to facilitate reporting and analysis . (Wikipedia)

Page 6: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data WarehouseData Warehouse

• A decision support database that is maintained separately from the organization’s operational database

• Support information processing by providing a solid platform of consolidated, historical data for analysis

• In your own words?

Page 7: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

4 main characteristics of data 4 main characteristics of data warehousing warehousing

1.1. Subject oriented Subject oriented • Organized around major subjects, such as customer, product, Organized around major subjects, such as customer, product,

sales, containing only information relevant for decision sales, containing only information relevant for decision support, unlike operational database which are product support, unlike operational database which are product orientedoriented

• Focusing on the modeling and analysis of data for decision Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processingmakers, not on daily operations or transaction processing

• Provide a simple and concise view around particular subject Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision issues by excluding data that are not useful in the decision support processsupport process

Page 8: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

4 main characteristics of data 4 main characteristics of data warehousing warehousing

2.2. Integrated Integrated • Constructed by integrating multiple, heterogeneous data Constructed by integrating multiple, heterogeneous data

sourcessources• Must place data from different sources into a consistent Must place data from different sources into a consistent

format, to do so they must deal with naming conflict and format, to do so they must deal with naming conflict and discrepancies discrepancies

• Data cleaning and data integration techniques are Data cleaning and data integration techniques are appliedapplied

• Ensure consistency in naming conventions, encoding Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data structures, attribute measures, etc. among different data sourcessources

• When data is moved to the warehouse, it is convertedWhen data is moved to the warehouse, it is converted

Page 9: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

4 main characteristics of data 4 main characteristics of data warehousing warehousing

3. Time variant (time series) 3. Time variant (time series) • maintains historical data, data for analysis from multiple maintains historical data, data for analysis from multiple sources contain multiple time pointssources contain multiple time points• The time horizon for the data warehouse is significantly The time horizon for the data warehouse is significantly

longer than that of operational systemslonger than that of operational systems• Operational database: current value dataOperational database: current value data• Data warehouse data: provide information from a Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)historical perspective (e.g., past 5-10 years)

Page 10: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

4 main characteristics of data 4 main characteristics of data warehousing warehousing

4. Non-volatile4. Non-volatile• after data are entered into a data warehouse, users cannot after data are entered into a data warehouse, users cannot

change or update the data.change or update the data.• A physically separate store of data from the operational A physically separate store of data from the operational

environmentenvironment• Operational update of data does not occur in the data Operational update of data does not occur in the data

warehouse environmentwarehouse environment• Does not require transaction processing, recovery, and Does not require transaction processing, recovery, and

concurrency control mechanismsconcurrency control mechanisms• Requires only two operations in data accessing: Requires only two operations in data accessing:

• Initial loading of data and access of dataInitial loading of data and access of data

Page 11: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data Warehouse Vs. Operational Data Warehouse Vs. Operational DBMSDBMS

• OLTP (on-line transaction processing)OLTP (on-line transaction processing)– Major task of traditional relational DBMSMajor task of traditional relational DBMS– Day-to-day operations: purchasing, inventory, banking, Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.manufacturing, payroll, registration, accounting, etc.• OLAP (on-line analytical processing)OLAP (on-line analytical processing)

– Major task of data warehouse systemMajor task of data warehouse system– Data analysis and decision makingData analysis and decision making

• Distinct features (OLTP vs. OLAP):Distinct features (OLTP vs. OLAP):– User and system orientation: customer vs. marketUser and system orientation: customer vs. market– Data contents: current, detailed vs. historical, consolidatedData contents: current, detailed vs. historical, consolidated– Database design: ER + application vs. star + subjectDatabase design: ER + application vs. star + subject– View: current, local vs. evolutionary, integratedView: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex queriesAccess patterns: update vs. read-only but complex queries

Page 12: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

OLAPOLAP

• Online Analytical Processing (OLAP) is an Online Analytical Processing (OLAP) is an industry-accepted reporting technology that provides high-industry-accepted reporting technology that provides high-performance analysis and easy reporting on large volumes of performance analysis and easy reporting on large volumes of datadata

• The goal of OLAP, also known as multidimensional data The goal of OLAP, also known as multidimensional data analysis, is to provide fast and flexible data summarization, analysis, is to provide fast and flexible data summarization, analysis, and reporting capabilities with the ability to view analysis, and reporting capabilities with the ability to view trends over timetrends over time

Page 13: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

OLAPOLAP

• OLAP applications, also called decision support systems (DSS), OLAP applications, also called decision support systems (DSS), have the following features:have the following features:– Enable users to look at different relationships in data by Enable users to look at different relationships in data by

looking beyond traditional two-dimensional row and looking beyond traditional two-dimensional row and column data analysiscolumn data analysis

– Offer high-performance access to large amounts of Offer high-performance access to large amounts of presummarized datapresummarized data

– Give users the power to retrieve answers to multi-Give users the power to retrieve answers to multi-dimensional business questions quickly and easilydimensional business questions quickly and easily

– Provide slice-and-dice views of multiple relationships in Provide slice-and-dice views of multiple relationships in large quantities of presummarized datalarge quantities of presummarized data

Page 14: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

OLTPOLTP• OLTP (on-line transaction processing)– Major task of traditional relational DBMS– Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

Page 15: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

OLTP vs OLAPOLTP vs OLAPOLTP OLAP

Users Clerk, IT professional Knowledge worker

Function Day to day operations Decision support

DB Design Application-oriented Subject-oriented

DataCurrent, up-to-datedetailed, flat relationalIsolated

Historical, summarized, multidimensional, integrated, consolidated

Usage Repetitive Ad-hoc

AccessRead/writeIndex/hash on prim. Key

Lots of scans

Unit of Work Short, simple transaction Complex query

# Records Accessed Tens Millions

# Users Thousands Hundreds

DB Size 100MB-GB 100GB-TB

Metric Transaction throughput Query throughput, response

Page 16: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data Warehousing - ConceptData Warehousing - ConceptData mart – Smaller and focuses on a particular subject or department. – It is a subset of data warehouse/departmental data

warehouse– A data mart is a smaller DW designed around one problem,

organizational function, topic, or other focus area.Can be Dependent data mart

– A subset that is created directly from a data warehouse – Ensures that the end user is viewing the same version of the

data that are accessed by all other data warehouse usersOr Independent data mart

– A small data warehouse designed for a strategic business unit or a department

Page 17: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data Warehousing - ConceptData Warehousing - Concept• Enterprise data warehouse (EDW)– A large scale data warehouse used across the enterprise

for decision support– Used to provide data for many types of DSS, including

CRM, supply chain management, BPM, KMS etc• Metadata – Data about data. In a data warehouse, metadata describe

the contents of a data warehouse and the manner of its use

Page 18: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data Warehousing Data Warehousing Process OverviewProcess Overview

• Organizations continuously collect data, information, and knowledge at an increasingly accelerated rate and store them in computerized systems

• The number of users needing to access the information continues to increase as a result of improved reliability and availability of network access, especially the Internet

• Creating of data warehouse involves the following:– Data are imported from various internal and external resources– Cleansed and organized to suit the organization’s needs.– Data marts can be loaded for specific department/area (alternatively data marts are created first and later integrated into EDW)

Page 19: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data Warehousing Data Warehousing Process OverviewProcess Overview

The data warehousing process consists of the following steps:1. Data are imported from various internal and external sources2. Data are cleansed and organized consistently with the organization’s

needs3a. Data are loaded into the enterprise data warehouse4a.If desired, data marts are created as subsets of the EDW—or—3b.Data are loaded into data marts4b.The data marts are consolidated into the EDW5. Analyses are performed as needed

Page 20: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data Warehousing - Process Data Warehousing - Process OverviewOverview

The major components of a data warehousing process • Data sourcesData sources. Data are sourced from operational systems and possibly from

external data sources.• Data extractionData extraction. Data are extracted using custom-written or commercial

software called ETL.• Data loadingData loading. Data are loaded into a staging area, where they are

transformed and cleansed. The data are then ready to load into the data warehouse.

• Comprehensive databaseComprehensive database. This is the EDW that supports decision analysis by providing relevant summarized and detailed information.

• MetadataMetadata. Metadata are maintained for access by IT personnel and users. Metadata include rules for organizing data summaries that are easy to index and search.

• Middleware toolsMiddleware tools. Middleware tools enable access to the data warehouse from a variety of front-end applications.

Page 21: CISB594 – Business Intelligence Data Warehousing Part I

Data Warehousing - Process Overview Data Warehousing - Process Overview

Page 22: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Data Warehousing ArchitecturesData Warehousing Architectures • There are several basic architectures for data warehousing• To distinguished the architectures data warehouse is divided

into three parts:• The data warehouse itself• Data acquisition (back-end) software, which extracts data

from legacy systems and external sources, consolidates and loads into the data warehouse

• Client (front-end) software, which allows users access and analyze data from the warehouse

Page 23: CISB594 – Business Intelligence Data Warehousing Part I

Data Warehousing ArchitecturesData Warehousing Architectures

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Page 24: CISB594 – Business Intelligence Data Warehousing Part I

Data Warehousing Architectures Data Warehousing Architectures

1. Information interdependence between organizational units

2. Upper management’s information needs

3. Urgency of need for a data warehouse

4. Nature of end-user tasks

1. 5. Constraints on resources 2. 6. Strategic view of the

data warehouse prior to implementation

3. 7. Compatibility with existing systems

4. 8. Perceived ability of the in-house IT staff

5. 9. Technical issues6. 10. Social/political factors

Ten factors that potentially affect the architecture selection decision:

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Page 25: CISB594 – Business Intelligence Data Warehousing Part I

CISB594 – Business IntelligenceCISB594 – Business Intelligence

Now ask if ..Now ask if ..

You are able to:• Understand the basic definitions and concepts of data

warehouses• Understand how a data warehouse differs from a database• Describe the characteristics of data warehouse• Describe data warehouse process overview

CISB594 – Business IntelligenceCISB594 – Business Intelligence