12
1 University of Jeddah, Jeddah - KSA Faculty of Computing & IT – IS dept. Data Warehousing & Mining Pr. Jamel FEKI [email protected] Course Objective The objective of this course is to study the basic concepts of data warehousing and the required skills to develop and use them. It emphasizes on employing data warehousing to support the decision-making process. It also covers the architectures of data warehousing and the infrastructural settings to develop them. It explains various ways of extracting, analyzing data to support the decision-making process. This course is intended to develop the student’s ability to extract information from data and identify patterns and trends by designing a data warehouse and by applying data mining methods for classification, clustering, and association analysis. 2 UoJ, FCIT, IS Dept. CPIS-342 DW & M Course Outline 3 UoJ, FCIT, IS Dept. CPIS-342 DW & M PART I: Data Warehousing Chapter 1: Introduction to Data Warehousing & Mining Chapter 2: Drawbacks of Transactional Systems Chapter 3: Multidimensional Modeling Chapter 4: OLAP Algebra Chapter 5: Multidimensional Constraints Course Outline 4 UoJ, FCIT, IS Dept. CPIS-342 DW & M PART II: Data Mining Chapter 6: KDD Process Chapter 7: Data Mining Techniques Chapter 8: Association Rules Discovery Chapter 9: Automatic Classification Chapter 10: Decision Trees Chapter 11: Neural Networks

Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

1

University of Jeddah, Jeddah - KSAFaculty of Computing & IT – IS dept.

Data Warehousing & Mining

Pr. Jamel [email protected]

Course Objective

� The objective of this course is to study the basic concepts of data warehousing and the required skills to develop and use them. It emphasizes on employing data warehousing to support the decision-making process. It also covers the architectures of data warehousing and the infrastructural settings to develop them. It explains various ways of extracting, analyzing data to support the decision-making process. This course is intended to develop the student’s ability to extract information from data and identify patterns and trends by designing a data warehouse and by applying data mining methods for classification, clustering, and association analysis.

2UoJ, FCIT, IS Dept. CPIS-342 DW & M

Course Outline

3UoJ, FCIT, IS Dept. CPIS-342 DW & M

� PART I: Data Warehousing� Chapter 1: Introduction to Data Warehousing &

Mining

� Chapter 2: Drawbacks of Transactional Systems

� Chapter 3: Multidimensional Modeling� Chapter 4: OLAP Algebra

� Chapter 5: Multidimensional Constraints

Course Outline

4UoJ, FCIT, IS Dept. CPIS-342 DW & M

� PART II: Data Mining� Chapter 6: KDD Process� Chapter 7: Data Mining Techniques

� Chapter 8: Association Rules Discovery

� Chapter 9: Automatic Classification

� Chapter 10: Decision Trees� Chapter 11: Neural Networks

Page 2: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

2

� Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing

� An overview of data warehousing and OLAP technology. S. Chaudhuri, U. Dayal. ACM SIGMOD Record, 26:65-74. Mars 1997.

� Pirahesh. Data cube: a relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54. 1997.

� Building the data warehouse. W. H. Inmon. Jhon Wiley. 1992.

� Fundamentals of data warehouses. M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis. Springer-Verlag. 2000.

� OLAP solutions: building multidimensional information systems. E. Thomsen. John Wiley & Sons editors. 1997.

Bibliography: Non exhaustive list

5UoJ, FCIT, IS Dept. CPIS-342 DW & M

http://www.1keydata.com/datawarehousing/links.html

� The Data Warehousing Information Center : Provides information on tools and techniques to design, build, maintain, and retrieve information from a data warehouse. http://www.dwinfocenter.org/

� The Data Warehousing Institute : Provider of in-depth conferences, education, and training in the data warehousing and business intelligence industry. http://tdwi.org/Home.aspx

� ITtoolbox Portal for Data Warehousing : Content, community, and service for Data Warehousing professionals. Providing technical discussion, job postings, an integrated directory, news, and much more. http://datawarehouse.ittoolbox.com/

� Data Warehousing : Wilson Mar's data warehousing site. � http://www.wilsonmar.com/1datawh.htm

� Evaltech : Useful site for data warehousing tool selection. http://www.evaltech.com/

� ETL Tools Info : Provides information about different business intelligence aspects, especially focusing on the Datastage ETL tool. http://www.1keydata.com/datawarehousing/links.html

Internet links

6UoJ, FCIT, IS Dept. CPIS-342 DW & M

Prof. Jamel FEKI7

Chapter 1

Introduction to Data Warehousing & Mining

UoJ, FCIT, IS Dept. CPIS-342 DW & M Prof. Jamel FEKI

Basic Issues

More and more data is generated:� Banks, Telecommunication,

Commercial domains ...

� Scientific data: astronomy, biology, etc.

� Web: texts, images, video, etc.

� E-commerce

8

Nowadays, companies accumulate daily

Large Data Volumes

UoJ, FCIT, IS Dept. CPIS-342 DW & M

Page 3: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

3

Prof. Jamel FEKI

Basic Issues

Some concrete examples…� The european VLBI (Very Long

Baseline Interferometry) has 16telescopes, each of whichproduces 1GB per second ofastronomical data

���� The storage and analysis ofthis data is a serious problem

9UoJ, FCIT, IS Dept. CPIS-342 DW & M Prof. Jamel FEKI

Basic Issues

Some concrete examples …� AT&T, the leading telephone provider in

the United States, manages billions ofcalls per day

� The storage of data is a hard problem� The real time analysis of these calls is

very difficult to do

10UoJ, FCIT, IS Dept. CPIS-342 DW & M

Prof. Jamel FEKI

Basic Issues

Some concrete numbers …� The commercial DBs (referring to Winter Corp.

2003 Survey):� AT&T ~ 26 TB (1Terra Bytes = 1024 GB).

� France Telecom ~ 30 Tb

�Web:� Alexa internet archive (www.alexa.com ) 7 years

data ~ 500 TB� Google searches > 4 Billions of pages ~ several

hundreds of TB� IBM WebFountain (2003) ~ 160 TB� Internet Archive (www.archive.org ) ~ 300 TB

11UoJ, FCIT, IS Dept. CPIS-342 DW & M Prof. Jamel FEKI

Basic Issues

Some concrete numbers …� Referring to UC Berkeley (2003):

�5 EB (5 millions TB) is the volume of data created in the world inyear 2002

�About 40% of this data is produced by the Unated States�www.sims.berkeley.edu/research/projects/how-much-info-2003/

� Referring to IDC study (2007):�161 EB (161 millions TB) is the volume of data created in the

world in year 2006�In 2010, more than 988 EB of data�www.usatoday.com/tech/news/2007-03-05-data_N.htm

12UoJ, FCIT, IS Dept. CPIS-342 DW & M

Page 4: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

4

Prof. Jamel FEKI

Basic Issues

Automatic data collection tools make the Databases containan enormous quantity of data

Le development of computers along with their decreasingcosts have allowed many organizations to build cheaply largedatasets

[Kodratoff 1997] estimates that the amount of data in the worlddoubles every twenty months.

13UoJ, FCIT, IS Dept. CPIS-342 DW & M Prof. Jamel FEKI

Basic Issues

Unfortunately, this tremendous amount of data is oftenunder exploited

Necessity to exploit this big volumes of data� Is it possible to extract value from this data?

� Is it possible to use this data in the decision-makingprocess? Or clarify choices for the organization?

Which data is useful?The explanation of some facts (realities) is hidden withindataMay help Understanding of complex phenomena

14

Too many data But few knowledge !

���� Solution: Knowledge Discovery from Data[base] (KDD)

UoJ, FCIT, IS Dept. CPIS-342 DW & M

Prof. Jamel FEKI

Additionally, new motivations raise since 1990; dueto Globalization:�More serious Competition�Survival of the organization�Take advantage (profit) from the past (data)�Foresee and plan for the future (strategic decisions)

New context (environment)�New type of users: Decision-makers (=Decision user)�New requirements: Analytics and Knowledge extraction�New software tools�Additional Database features

UoJ, FCIT, IS Dept. CPIS-342 DW & M 15

Basic Issues

Prof. Jamel FEKI

The KDD Process

UoJ, FCIT, IS Dept. CPIS-342 DW & M 16

� Data mining — Core of knowledge discovery

Data Cleaning

Data Integration

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Data Warehouse

� Data Warehousing — A prior step to Data Mining

Page 5: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

5

Prof. Jamel FEKI

KDD & Business Intelligence

UoJ, FCIT, IS Dept. CPIS-342 DW & M 17

Increasing potentialto supportbusiness decisions End User

BusinessAnalyst

DataAnalyst

DBA

DecisionMaking

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Prof. Jamel FEKI

OLAPing

Modeling

KDD: Confluence 1 of Multiple Disciplines

UoJ, FCIT, IS Dept. CPIS-342 DW & M 18

Statistics

Algorithm

PatternRecognition

KDD

Database TechnologyData Mining

(DM)

Other disciplines…

junction 1التقاء

Data Warehousing

(DW)Visualization

This course introduces these two disciplines: DW & DM

Prof. Jamel FEKI

What are they?

Data Warehousing

Data Mining Fayyad et al. (1997):

19

Data Mining is a step in the KDD process that consists of applying

data analysis and discovery algorithms that produce a particular

enumeration of patterns (or models) over the data.

UoJ, FCIT, IS Dept. CPIS-342 DW & M

Data Warehousing is the step of KDD process that consists of

designing, implementing and using a data warehouse to supporting the

decision-making process.

A Data Warehouse is a special database built by integrating data

issued from heterogeneous data sources, and used On-Line analytical

processing and Data Mining.

Prof. Jamel FEKI

Why Not Traditional System (OLTP)?

A Traditional System is also called:� Transactional System, or Operational system

� OLTP (On-Line Transaction Processing).

OLTP is designed by focusing on satisfying the ISusers requirements:� They need to automate the daily activities of the organization

� They are not interested in the Decision-making processConsequently , the OLTP is limited to five operational functions(Data Input, storage, processing , querying, and distribution)

OLTP ignores the Decision-makers needs (e.g.,Analysis)

UoJ, FCIT, IS Dept. CPIS-342 DW & M 20

Page 6: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

6

Prof. Jamel FEKI

Why Not Traditional Software (DBMS..)?

DMBS are transaction-centered designed

� Data�Are vital for the organization: are used in transactions�Track their daily activities (Business transactions)� Are intended for the operational system

� Processing through Transactions *, characterizedwith:�Concurrent access

�Short time processing

�Few data manipulated (accessed and/or updated)

UoJ, FCIT, IS Dept. CPIS-342 DW & M 21

* Transaction properties are noted ACID (Atomicity, Consistency, Isolation, Durability)

Prof. Jamel FEKI

Observations

Positive Aspects ☺☺☺☺:

� Maturity of OLTP�DBMS

�CASE Tools�4GL

�OLTP comprehensive enough � Relational culture

UoJ, FCIT, IS Dept. CPIS-342 DW & M 22

Prof. Jamel FEKI

Observations

Weakness � : (Are consequences of OLTPobjectives)�OLTP systems are designed for the Operational

subsystem�Data are stored at a very high level of detail

�Manage transactions�Details of operations: Purchase, Sales, Client Orders…

�Low adaptation to the Managers (decison makers)�Few of decisional information:

Turnover by periodTurnover by category of productsConsolidated data,..Dashbord Pre-programmed for predefined needs

UoJ, FCIT, IS Dept. CPIS-342 DW & M 23Prof. Jamel FEKI

Observations

Data in OLTP systems:�Are detailed,

�Could not be directly used by Manager :

Techniques� Design Methodologies completely ignore

decisional requirements� DBMS not developed for such user requirements:

� Normalization ���� Pb

Joins, Cost, Crash (++ joins)Performance degradation (slow down)

UoJ, FCIT, IS Dept. CPIS-342 DW & M 24

Page 7: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

7

Prof. Jamel FEKI

Observations

Finally, many researchers claimed:“ Overabundance of data and Lack ofinformation”; and more precisely, there is aLack of knowled ge. [J.M. Gouarné 1998]

Let’s read the illustrative example (next slide)

UoJ, FCIT, IS Dept. CPIS-342 DW & M 25Prof. Jamel FEKI

Illustrative Example

If you manager asks you about the travel fees of theemployees of the company and then you meet him in hisoffice spilling1 an heterogeneous heap of hotel bills , airlinetickets , highway tickets , with amounts expressed indifferent currencies some are tax free and others are taxincluded, so you provide data with undeniable accuracy, butthe information may not be exactly what your manager iswaiting. [J.M. Gouarné 1998]

Do your manager has time for exploiting this huge, detailed amount of data?

Do you correctly answered his query?

���� Decision Makers need aggregated data but not detailed

1 Giving 2 Mountain كومة

UoJ, FCIT, IS Dept. CPIS-342 DW & M 26

Prof. Jamel FEKI

Conclusion

From many years, it is commonly admittedthat the OLTP system is unable to supply theright information (i.e., decisional data) at theright time (i.e., when the decision has to bemade) for decision makers .���� New solutions should be found

Solutions will touch many levels�Modeling level

�Implementing level (logical and physical)�Architectural level

UoJ, FCIT, IS Dept. CPIS-342 DW & M 27Prof. Jamel FEKI

New Solutions

UoJ, FCIT, IS Dept. CPIS-342 DW & M 28

Modeling level: Specific models“ E/R data models [...] cannot be understood by users and they c annotbe navigated usefully by DBMS software.E/R models cannot be used as the basis for enterprise datawarehouses.” (Kimball, 96)

Implementation level: Optimization/manipulation� Pre-calculated queries (Materialized views)� Parallel computing� User-friendly interfaces dedicated to decision makers (Ad hocqueries)� OLAP Algebra

Architectural level� Data Warehouse� Data mart

Page 8: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

8

Prof. Jamel FEKI

Decision Support System (DSS)

A Decision Support System is an Information Systemintended for a predictive management (Measure,Evaluate, know, foresee)

UoJ, FCIT, IS Dept. CPIS-342 DW & M 29

Internal DataBases(transactional)

Web

PartnersBase

Data Warehouse

Avalable SystemsAnalyses

Manipulations

Queries

Data Mining

INPUT OUTPUTSTORAGE

Prof. Jamel FEKI

OLTP Vs OLAP

UoJ, FCIT, IS Dept. CPIS-342 DW & M 30

OLTP OLAP

Objective Transactional Decisional

Users Many, and competitors Small number

Manipulations Destructive updates (INSERT,

UPDATE and DELETE)Incremental updates (INSERT)

Repetitiveness Execute many times the same task

Single use queryAd-hoc

Performances Required Less required

Response Time Instantly (real time) few tens of seconds to 1 minute

Data Structure Normalized (CODD theory) Often aggregated

Historization Generally not supported Data is Time stamped

Tables Too Many tables Normalized � Small size

Few tablesNF2 � Redundant, Big size

Prof. Jamel FEKI

DB Vs DW

The Database (DB) is the core component of the operationalInformation System (OLTP),whereasThe Data Warehouse (DW) is the Core component of the ModernDecision Support Systems (OLAP).

UoJ, FCIT, IS Dept. CPIS-342 DW & M 31

DBDBDataBase

Data Warehouse

ETL Process

Relationship

OLTP Information System Relationship OLTP- OLAP OLAP System

Prof. Jamel FEKI

Data Warehouse Definition

A data warehouse is a subject-oriented, integrated,time-variant and non-volatile collection of data insupport of management's decision making process ». [Inmon 94]

http://www.1keydata.com/datawarehousing/data-wareho use-definition.html

UoJ, FCIT, IS Dept. CPIS-342 DW & M 32

Page 9: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

9

Prof. Jamel FEKI

Subject-Oriented : A DW can be used to analyze aparticular subject area.

Example, “Sales" is a subject.

UoJ, FCIT, IS Dept. CPIS-342 DW & M 33

Data Warehouse Definition

Products

BranchesClients

Suppliers

Sales

Purchases

…………

Bills

Prof. Jamel FEKI

Integrated : A DW integrates data from multiple datasources.Examples:

1) Different IDs: Sources A and B may have different ways foridentifying a product, but in a DW, there will be only a single way foridentifying a product.

2) Formats, formalism, field name, ...

Problems:� Data Semantic: explicit the meaning, solve conflits

� Codification : Unify

� Unavailable data (nulls)

UoJ, FCIT, IS Dept. CPIS-342 DW & M 34

Data Warehouse Definition

Prof. Jamel FEKI

Time-Variant : Historical data is kept in a DW.

One can retrieve data from 3 months, 6 months, 12months, or even older data from a DW. This contrastswith a transactions system, where often only the mostrecent data is kept.

Example: a transaction system may hold the most recentaddress of a customer, whereas a DW can hold alladdresses associated with a customer.

Note that DW data is Time- and Geo-referenced:� DW data is recorded in order to be later analyzed through Time

but also through geographical coordinates.

UoJ, FCIT, IS Dept. CPIS-342 DW & M 35

Data Warehouse Definition

Prof. Jamel FEKI

Non-volatile : Once data is loaded in the DW, it will notchange. So, historical data in a DW should never bealtered

A data warehouse is a copy of transaction dataspecifically structured for query and analysis.

UoJ, FCIT, IS Dept. CPIS-342 DW & M 36

Data Warehouse Definition

DWSource

INSERT

UPDATEDELETE

SELECTSELECT

Loading

Data Manipulation Operations in a transaction system and in a DW

Page 10: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

10

Prof. Jamel FEKI

'A DSS is built around a corporate memory: Thedynamics of the organization appears in the data itselfand not in the procedures of data-processing’. [J. M.Franco].A DSS is considered as an Horizontal repository for theorganization.

UoJ, FCIT, IS Dept. CPIS-342 DW & M 37

Data Warehouse Definition

Prof. Jamel FEKI

" support of management's decision making process" = Aggregate

UoJ, FCIT, IS Dept. CPIS-342 DW & M 38

Data Warehouse Definition

Date Quantity Unit Price06/05/2002 10 12007/05/2002 5 14008/05/2002 12 10009/05/2002 12 15010/05/2002 2 25011/05/2002 3 5012/05/2002 15 80

13/05/2002 11 12014/05/2002 8 13515/05/2002 10 10016/05/2002 10 16017/05/2002 3 25018/05/2002 5 4519/05/2002 30 60

Sales

Date AmountWeek 18 6750

Week 19 7775

Sales

Prof. Jamel FEKI

In the data warehousing field, we often hear about discussions on where aperson/organization's philosophy falls into Bill Inmon's camp or into RalphKimball's camp. We describe below the difference between the twophilosophies.

Bill Inmon's paradigm : DW is one part of the overall BI system. An enterprisehas one DW, and data marts source their information from the DW. In the DW,information is stored in 3NF.

Ralph Kimball's paradigm : DW is the conglomerate of all data marts withinthe enterprise. Information is always stored in the dimensional model.

There is no right or wrong between these two ideas, as they represent differentdata warehousing philosophies. In reality, the DW in most enterprises are closerto Ralph Kimball's idea. This is because most DWs started out as adepartmental effort, and hence they originated as a data mart. Only when moredata marts are built later do they evolve into a DW.http://www.1keydata.com/datawarehousing/inmon-kimball.html

DW & Data Marts: Two different philosophies

UoJ, FCIT, IS Dept. CPIS-342 DW & M 39Prof. Jamel FEKI

UoJ, FCIT, IS Dept. CPIS-342 DW & M 40

Data Warehouse Architecture

���� Two storage spaces: 1 Central DW, n DM

� DSS = DW + { DM } or DSS = { DM }

Page 11: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

11

Prof. Jamel FEKIUoJ, FCIT, IS Dept. CPIS-342 DW & M 41

Inmon Vs Kimball Architecture: The Benefits

Prof. Jamel FEKI

Storage spaces1) Data warehouse:�Gathers all pertinent data for all managers

�Loaded from operational sources�Modeled as a conventional DB.

UoJ, FCIT, IS Dept. CPIS-342 DW & M 42

Inmon’s Architecture

Prof. Jamel FEKI

Storage spaces1) Data Marts:�Extract from the DW

�Specific for a group of Managers, and�Modeled according to an appropriate model for analytical (OLAP)

software tools..

UoJ, FCIT, IS Dept. CPIS-342 DW & M 43

Inmon’s Architecture Principle of DSS

Need of a specific Decisional Data ModelRemember : Decision makers look for indicators describing theirbusiness activities. These indicators are determined bycorrelations/consolidations of data sets, and they should beindependent of the operational procedures of their company.

� The way according to which data is perceived should be completelyindependent of the data structures and procedures of the transactionalsystem.In addition, the following criteria are not required:�Transactional performances

�Referencial integrity�Data Normalization (CODD’s NF)

44UoJ, FCIT, IS Dept. CPIS-342 DW & M

Page 12: Course Objective Data Warehousing & Mining Chapter 1.pdf · Using the Data Warehouse, INMON W.H. & HACKATHORN R. D. Wiley, Computer Publishing An overview of data warehousing and

12

Prof. Jamel FEKI

Example: Sales analyses.

First, have a glance on the E/R diagram of thenext slide

Get the amount of sales for a specific category ofproducts, realized with customers of a certaingeographical area, and for which delivery wasrealized from stores in a given geographical site.

Based on the E/R model of the next slide, queries arevery hard to express, whatever is the simplicity of thequery language used (e.g. SQL)!

UoJ, FCIT, IS Dept. CPIS-342 DW & M 45

Need for a specific Decisional data Model

Prof. Jamel FEKI

Sample E/R diagram for a OLTP

UoJ, FCIT, IS Dept. CPIS-342 DW & M 46

Category Variety Product

Delivery

Agreement

Agency Site

Mart Factory

Agreement Type

Vendor

Client

Bill