62
Data Warehousing & Mining Techniques & Mining Techniques Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

Embed Size (px)

Citation preview

Page 1: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

Data Warehousing & Mining Techniques& Mining TechniquesWolf-Tilo BalkeSilviu HomoceanuInstitut für InformationssystemeTechnische Universität Braunschweighttp://www.ifis.cs.tu-bs.de

Page 2: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Lecture– 02. 04.2009 –11.07.2009

– 10:30-13:00 (3 lecture hours !)

– Exercises, detours, and home work discussion integrated into lecture

0. Organizational Issues

integrated into lecture

• 4 Credits

• Exams– Oral exam

– 50% of exercise points needed

to be eligible for the exam

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

Page 3: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Bad decisions can lead to disaster

– Data Warehousing is at

the base of decision

support systems

0. Why should you be here?

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

Page 4: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Data Warehousing & OLAP is important

• It helps to

– Understand the information hidden within

the organization’s data

0. Why should you be here?

the organization’s data

• See data from different angles: product, client, time, geographical area

• Get adequate statistics to get your point of argumentation across

• Get a glimpse of the future…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

Page 5: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• And because you love databases…

0. Why should you be here?

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5

Page 6: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Building the Data Warehouse

– William H. Inmon

– Wiley, ISBN 0-7645-9944-5

• The Data Warehouse Toolkit

0. Recommended Literature

• The Data Warehouse Toolkit

– Ralph Kimball & Margy Ross

– Wiley, ISBN 0-471-20024-7

• Data Warehouse-Systeme

– Andreas Bauer & Holger Günzel

– dpunkt.verlag, ISBN 3-89864-251-8

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6

Page 7: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• The Data Warehouse ETL Toolkit

– Ralph Kimball & Joe Caserta

– Wiley, ISBN 0-7645-6757-8

• OLAP Solutions

0. Recommended Literature

• OLAP Solutions

– Erik Thomsen

– Wiley, ISBN 0-471-40030-0

• Data Warehouses and OLAP

– Robert Wrembel & Christian Koncilia

– IRM Press, ISBN 1-59904364-5

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

Page 8: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

1 Introduction

1.1 What is a data warehouse?

1.2 Applications and users

1.3 Lifecycle / phases of a data

1. Introduction

1.3 Lifecycle / phases of a data warehouse

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

Page 9: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Basically a very large database…

– Not all very large databases are data warehouses,

but all data warehouses are pretty large databases

– Nowadays a warehouse is considered

I.I What is a data warehouse?

Nowadays a warehouse is considered

to start at around 800 GB and goes

up to several TB

– It spans over several servers and needs an impressive amount of computing power

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

Page 10: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• More specific, a collective data repository

– Containing snapshots of the operational data (history)

– Obtained through data cleansing (Extract-Transform-Load)

I.I What is a data warehouse?

– Useful for analytics

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

Page 11: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Compared to other solutions it…

– Is suitable for tactical/strategic focus

– Implies a small number of transactions

– Implies large transactions spanning over a long

I.I What is a data warehouse?

Implies large transactions spanning over a long period of time

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11

Page 12: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Experts say…

– Ralph Kimball: “a copy of transaction data specifically structured for query and analysis”

– Bill Inmon: “A data warehouse is a:

I.I Some Definitions

– Bill Inmon: “A data warehouse is a:– Subject oriented

– Integrated

– Non-volatile

– Time variant

collection of data in support of management’s decisions.”

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

Page 13: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Subject oriented

– The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together

• Typical subject areas in DWs are

I.I Inmon Definition

• Typical subject areas in DWs are Customer, Product, Order, Claim, Account,…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

Page 14: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Subject oriented

– Example: customer as subject in a DW

• DW is organized in this case by the customer

• It may consist of 10, 100 or more physical tables, all related

I.I Inmon Definition

CUSTOMERall related

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

Base customer

Data 2000 - 2002

Base customer

Data 2003 - 2005Customer activity

2001 - 2004

Customer activity

Detail 2002 - 2004

Customer activity

Detail 2005 - 2006

CUSTOMER

Page 15: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Integrated

– The data warehouse contains data from most or all of an organization's operational systems and this data is made consistent

– E.g. gender, measurement, conflicting keys,

I.I Inmon Definition

– E.g. gender, measurement, conflicting keys, consistency,…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

appl A – M,F

operational encoding DW

M,F

appl B – 1,0

appl C – male,female

appl A – cm

appl B – inches

cm

Page 16: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Non-volatile– Data in the data warehouse is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting

– Data is loaded, but not updated

I.I Inmon Definition

– Data is loaded, but not updated

– When subsequent changes occur, a new snapshot record is written

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

Record-by-record

manipulation

insert

delete

change

access

access

Operational

Mass load/

access of data

load

DW

access

Page 17: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Time-varying

– The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time

– Different environments have different time horizons

I.I Inmon Definition

– Different environments have different time horizons associated

• While for operational systems a 60-to-90 day time horizon is normal, data warehouse has a 5-to-10 year horizon

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17

Page 18: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• More general, a DW is a

– Repository of an organization’s electronically stored data

– Designed to facilitate reporting and analysis

I.I General Definition

reporting and analysis

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18

Page 19: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• DW typically…

– Reside on computers dedicated to this function

– Run on DBMS such as Oracle, IBM DB2, Teradataor Microsoft SQL Server

I.I Typical Features

– Retain data for long periods of time

– Consolidate data obtained from a variety of sources

– Are built around their own carefully designed data model

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

Page 20: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• DW stands for big data volume, so lets take an example of 2 big companies, a retailer, say Walmartand a RDBMS vendor, Sybase:– Walmart CIO: I want to keep track of sales in

all my stores simultaneously

1.1 Use Case

all my stores simultaneously

– Sybase consultant: You need our wonderful RDBMS software. You can stuff data in as sales are rung up at cash registers and simultaneously query data right in your office

– So Walmart buys a $1 milion Sun E10000 multi-CPU server, a $500 000 Sybase license, a book

“Database Design for Smarties”, and build

themselves a normalized SQL data model

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

Page 21: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• After a few months of stuffing data into the tables… a Walmart executive asks…– I have noticed that there was a Colgate promotion recently, directed to people who live in small towns. How much toothpaste did we sell in those towns yesterday?

– Translation to a query:

1.1 Use Case

– Translation to a query:• select sum(sales.quantity_sold) from sales, products, product_categories, manufacturers,

stores, cities where manufacturer_name = ‘Colgate’ and product_category_name = ‘toothpaste’and cities.population < 40 000and trunc(sales.date_time_of_sale) = trunc(sysdate-1)and sales.product_id = products.product_idand sales.store_id = stores.store_idand products.product_category_id = product_categories.product_category_idand products.manufacturer_id = manufacturers.manufacturer_idand stores.city_id = cities.city_id

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21

Page 22: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– The tables contain large volumes of data and

the query implies a 6 way join so it will take

some time to execute

– The tables are at the same time also updated by new sales

1.1 Use Case

sales

– Soon after executive start their quest for marketing information store employees notice that there are times during the day when it is impossible to process a sale

Any attempt to update the database results in freezing the computer up for 20 minutes

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

Page 23: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• In minutes…the Walmart CIO calls Sybase tech support

• Walmart CIO: WE TYPE IN THE TOOTHPASTE QUERY AND OUR SYSTEM HANGS!!!

• Sybase support: Of course it does! You built an on-line transaction processing (OLTP) system. You can’t feed it a

1.1 Use Case

transaction processing (OLTP) system. You can’t feed it a decision support system (DSS) query and expect things to work!

– Walmart CIO: !@%$#. I thought this was the whole point of SQL and your RDBMS…to query and insert simultaneously!!

– Sybase support: Uh, not exactly. If you’re reading from the database, nobody can write to the database. If you’re writing to the database, nobody can read from the database. So if you’ve got a query that takes 20 minutes to run and don’t specify special locking instructions, nobody can update those tables for 20 minutes.

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23

Page 24: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Walmart CIO: It sounds like a bug.

– Sybase support: Actually it is a feature. We call it pessimistic locking.

– Walmart CIO: Can you fix your system so that it doesn’t lock up???

– Sybase support: No. But we made this great loader tool so that you can copy

everything from your OLTP system into a separate Data Warehouse system

at 100 GB/hour

1.1 Use Case

• After a while…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

Page 25: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• OLTP (OnLine Transaction Processing):

– Also known under the name of operational data, it represents day-to-day operational business activities:

• Purchasing, sales, production distribution, …

– Typically for data entry and retrieval transaction

I.I OLTP vs. DW

– Typically for data entry and retrieval transaction processing

– Reflects only the current state of the data

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

Page 26: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• OLAP (OnLine Analytical Processing):

– Represents front-end analytics based on a DW repository

– It provides information for activities like

Resource planning, capital budgeting, marketing initiatives,…

I.I OLTP vs. DW

• Resource planning, capital budgeting, marketing initiatives,…

– It is decision oriented

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

Page 27: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Properties

I.I OLTP vs. DW

Operational DB DW

Mostly updates Mostly reads

Many small transactions Queries long, complex

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

MB-TB of data GB-TB of data

Raw data Summarized data

Clerical users Decision makers

Up-to-date data May be slightly outdated

Page 28: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Consider a normalized database for a store

– The tables would look like this…

I.I OLTP vs. DW

Customer

Customer_ID (PK)

Name

Invoice

Invoice_number (PK)

Date

Customer_ID (FK)

Invoice_line_item

Invoice_number (FK)

Item_seq_number

Product_ID (FK)

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

Address

City

Postal_code

Customer_ID (FK)

Status_code (FK)

Total

Sales_tax

Shipping_charge

Status

Status_code (PK)

Status

Sales_unit

Sales_unit_D (PK)

Name

Address

City

Postal_code

Telephone_number

Email

Product

Product_ID (PK)

Name

Description

Cost

Sales_unit_ID (FK)

Product_ID (FK)

Units

Unit_cost

Page 29: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• If we were to set up a DW for that store, we would start by building the following star schema

I.I OLTP vs. DW

Customer

Customer_ID (PK)

Name

Address

Product

Product_ID (PK)

Name

Description

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29

Address

City

Postal_codeSales_facts

Product_ID (FK)

Customer_ID (FK)

Time_key (FK)

Sales_unit_ID (FK)

Total

Sales_tax

Shipping_charge

Sales_unit

Sales_unit_ID (PK)

Name

Address

City

Postal_code

Telephone

Time_period

Time_key (PK)

Description

Day

Fiscal_week

Fiscal_period

Fiscal_year

Description

Cost

Page 30: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Basic insights from comparing OLTP and DWs

– A DW is a separate (RDBMS) installation that contains copies of data from on-line systems

• Physically separate hardware may not be absolutely necessary if you have lots of extra computing power,

I.I OLTP vs. DW

necessary if you have lots of extra computing power, but it is recommended

– With an optimistic locking DBMS you might even be able to get away for a while with keeping just one copy of your data

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

Page 31: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• There is an essentially different pattern of hardware utilization between on-line and analytical processing

I.I OLTP vs. DW

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31

Operational Data warehouse

Page 32: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Typical questions which can be answered with DW & OLAP– How much did sales unit A earn in January?

– How much did sales unit B earn in February?

– What was their combined sales amount for the first

I.2 Applications of DW

– What was their combined sales amount for the first quarter?

• Answering these questions with SQL-queries is difficult – Complex query formulation necessary

– Process is likely to be slow due to complex joins and multiple scans

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

Page 33: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Why such questions can be answered better with a DW?

– Because in a DW tables are rearranged and pre-aggregated (known as computing cubes)

• The tables arrangement is subject oriented, usually some

I.2 Applications of DW

• The tables arrangement is subject oriented, usually some star schema

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33

Page 34: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• A DW is the base repository for front-end analytics

– OLAP

– KDD (Knowledge Discovery in Databases) a data mining process

I.2 Applications of DW

K D Da data mining process

– Data visualization

– Reporting

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

Page 35: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• OLAP is a form of information processing and thus needs to provide timely, accurate and understandable information

– timely is however a relative term:

I.2 Applications of DW

• In OLTP we expect an update to go through in a matter of seconds

• In OLAP the time to answer a query can take minutes, hours or even longer

• There are many flavors of OLAP

– ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35

Page 36: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• KDD (Data Mining)

– Constructs models of the data in question

• Models can be viewed as high level summaries of the underlying data

I.2 Applications of DW

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36

ID Name ZIP Sex Age Income Children Car Spent

12 Peter 38106 M 35 € 55,000 2 Mini Van € 210.00

15 Gabriel 38100 M 32 € 56,000 0 SUV € 30.00

… … … … … … … … …

122 Claire 38106 F 21 € 42,000 0 Coupe € 50.00

Page 37: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Based on this example a query returns the data that fulfills the constraints

• SELECT * FROM CUSTOMER_TABLE WHERE TOTAL_SPENT > €100;

– Data mining might return the following set of rules

I.2 Applications of DW

– Data mining might return the following set of rules for customers spending more than €100:

• IF AGE > 35 AND CAR = ‘MINIVAN’ THEN TOTAL SPENT > €100

• IF SEX = ‘M’ AND ZIP = 38106 THEN TOTAL SPENT > €100

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

Page 38: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– It answers questions like

• Which products or customers are more profitable

• Which outlets have sold the least this year

– In consequence it motivates decisions like

• Which products should have their production increased

I.2 Applications of DW

• Which products should have their production increased

• Which customers should be targeted for special promotions

• Which outlets should be closed

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

Page 39: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Users of DW are called DSS analysts and usually are business persons

– Their primary job is to define and discoverinformation used in corporate decision-making

– The way they think

I.2 Who is the user?

– The way they think

• “Give me what I say I want, and then I can tell you what I really want”

• They work in explorative manner

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

Page 40: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Typical explorative line of work

• “Ah! Now that I see what the possibilities are, I can tell what I really want to see. But until I know what the possibilities are, I cannot describe exactly what I want…”

– This usage has profound effect on the way a DW is

I.2 Who is the user?

– This usage has profound effect on the way a DW is developed

• The classical system development life cycle assumes that the requirements are known at the start of design

• The DSS analyst starts with existing requirements, but factoring in new requirements is almost impossible

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

Page 41: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• DW System Development Life Cycle (SDLC)

– Design

• End-user interview cycles

• Source system cataloging

• Definition of key performance indicators

I.3 Lifecycle of DW

• Definition of key performance indicators

• Mapping of decision-making processes underlying information needs

• Logical and physical schema design

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41

Page 42: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Prototype

• Objective is to constrain and in some cases reframe end-user requirements

– Deployment

• Development of documentation

I.3 Lifecycle of DW

• Development of documentation

• Training

• Operations and management processes

– Operation

• Day-to-day maintenance of the DW needs a good management of ongoing Extraction, Transformation and Loading (ETL) process

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42

Page 43: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Enhancement needs the modification of

• HW - physical components

• Operations and management processes

• Logical schema designs

I.3 Lifecycle of DW

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

4

3

1

2

Design

Prototype

Deploy

Operate

Enhance

Page 44: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Classical SDLC vs. DW SDLC

I.3 Lifecycle of DW

Requirements DW

Program

– DW SDLC is almost the opposite of classical SDLC

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

Program

Requirements

Page 45: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Classical SDLC vs. DW SDLC

I.3 Lifecycle of DW

Classical SDLC DW SDLC

Requirements gathering Implement warehouse

Analysis Integrate data

Design Test for bias

– Because it is the opposite of SDLC, DW SDLC is also called CLDS

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45

Design Test for bias

Programming Program against data

Testing Design DSS system

Integration Analyze results

Implementation Understand requirements

Page 46: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• CLDS is a data driven development life cycle

– It starts with data

• Once data is at hand it is integrated and tested against bias

• Programs are written against the data and the results are analyzed and finally the requirements of the system are

I.3 Lifecycle of DW

analyzed and finally the requirements of the system are understood

• Once requirements are understood,

adjustments are made to the design and

the cycle starts all over

– “spiral development methodology”

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

Page 47: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• In Operating a DW the following phases can be identified

– Monitoring

– Extraction

I.3 Operating a DW

– Transforming

– Loading

– Analyzing

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47

Page 48: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Monitoring

– Surveillance of the data sources

– Identification of data modification which is relevant to the DW

I.3 Monitoring

– Monitoring has an important role over the whole process deciding on which data the next steps will be applied on

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

Page 49: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Monitoring techniques– Active mechanisms - Event Condition Action (ECA) rules:

– Replication mechanisms

I.3 Monitoring

EVENT Payment

CONDITION Account sum > 10 000 €

ACTION Transfer to economy account

– Replication mechanisms• Snapshot:

– Local copy of data, similar to a View

– Used by Oracle 9i

• Data replication– Replicates and maintains data in destination tables through data

propagation processes

– Used by IBM

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

Page 50: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Protocol based mechanisms

• Since DBMS write protocol data for transaction management, the protocol can be used also for monitoring

• Difficult due to the fact that the protocol format is proprietary and subject to change

I.3 Monitoring

proprietary and subject to change

– Application managed mechanisms

• Hard to implement for legacy systems

• Based on time stamping or data comparison

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

Page 51: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Extraction

– Reads the data which was selected throughout the monitoring phase and inserts it in the data structures of the workplace

– Due to large data volume, compression can be used

I.3 Extraction

– Due to large data volume, compression can be used

– The time-point for performing extraction can be:

• Periodical:

– Weather or stock market information can be actualized more times in a day, while product specification can be actualized in a longer period of time

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

Page 52: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• On request:

– For example when a new item is added to a product group

• Event driven:

– Event driven extraction can be helpful in scenarios where time, or the number of modifications over passing a specified threshold triggers the extraction. For example each night at 03:00 or each

I.3 Extraction

triggers the extraction. For example each night at 03:00 or each time 50 new modifications took place, an extraction is performed

• Immediate:

– In some special cases like the stock market it can be necessary that the changes propagate immediately to the warehouse

– The extraction largely depends on hardware and the software used for the DW and the data source

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

Page 53: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Transforming

– Implies adapting data, schema as well as data quality to the application requirements

– Data integration:

Transformation in de-normalized data structures

I.3 Transforming

• Transformation in de-normalized data structures

• Handling of key attributes

• Adaptation of different types of the same data

• Conversion of encoding:

– “Buy”, “Sell” 1,2 vs. B, S 1,2

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53

Page 54: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Normalization:– “Michael Schumacher” “Michael, Schumacher” vs.

“Schumacher Michael” “Michael, Schumacher”

• Date handling:– “MM-DD-YYYY” “MM.DD.YYYY”

• Measurement units and scaling:

I.3 Transforming

• Measurement units and scaling:– 10 inch 25,4 cm

– 30 mph 48,279 km/h

• Save calculated values– Price_incl_VAT = Price_excl_VAT * 1.19

• Aggregation– Daily sums can be added into weekly ones

– Different levels of granularity can be used

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

Page 55: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Data cleaning:

• Consistency check

– Delivery_date < Order_date

• Completeness

– Management of missing values as well as NULL values

I.3 Transforming

– Management of missing values as well as NULL values

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

Page 56: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Loading– Loading usually takes place during weekends or nights

when the system is not under user stress

– Split between initial load to initialize the DW and the periodical load to keep the DW updated

I.3 Loading

periodical load to keep the DW updated

– Initial loading • Implies big volumes of data and

for this reason a bulk loader

is used

– Usually performed by partitioning, parallelization and incremental actualization

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

Page 57: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Analyze– Data access

• Useful for extracting goal oriented information:– How many iPhones 3G were sold in the Braunschweig stores of T-

Mobile in the last 3 calendar weeks of 2008?

– Although it is a common OLTP query, it might be to complex for

I.3 Analyzing

– Although it is a common OLTP query, it might be to complex for the operational environment to handle

– OLAP• Falsely used as representing DW because it is used to

analyze data contained in DW

• Used to answer requests like: – In which district does a product group register the highest profit

– How did the profit change in comparison to the previous month?

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

Page 58: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

– Mostly known as organized on a multidimensional data model

– Common operations for analyze are:

» Pivoting/Rotation

» Roll-up, Drill-down and Drill-across

» Slice and Dice

– Data mining• Useful for identifying hidden patterns

I.3 Analyzing

• Useful for identifying hidden patterns

• Refers to two separate processes:– KDD (Knowledge Discovery in Databases)

– Prediction

• Useful for answering questions like:– How did the sales of this product group evolve?

• Methods and procedures for data mining– Clustering, Classification, Regression, Association rule learning

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

Page 59: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• The concept of data warehousing dates back to the 1980s when IBM researchers Barry Devlin and Paul Murphy introduced the "business data warehouse“

• Why?

1.3 History of DW

• Why?– An enormous amount of redundancy of

information was required to support the multiple decision support environments that usually existed

– In larger corporations it was typical for multiple decision support environments to operate independently

– Each environment served different users but often required much of the same data

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59

Page 60: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Based on analogies with real-life warehouses, DW were intended as large-scale

collection/storage/staging

areas for corporate data

1.3 History of DW

areas for corporate data

– Data could be retrieved from

one central point or data could be distributed to "retail stores" or “data marts" which were tailored for ready access by users.

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

Page 61: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

1.3 History of DW

Early years of DW

1960s General Mills and Dartmouth College develop the terms dimensions and facts in a

joint research project

1970s ACNielsen provides dimensional data marts for retail sales

1983 Teradata introduces a database management system specifically designed for

decision support

1988 Barry Devlin and Paul Murphy publish the article “An architecture for a business and

information systems” in IBM Systems Journal where they introduce the term

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61

information systems” in IBM Systems Journal where they introduce the term

"business data warehouse"

1990 Red Brick Systems introduces Red Brick Warehouse, a database management system

specifically for data warehousing

1991 Prism Solutions introduces Prism Warehouse Manager, software for developing a

data warehouse

1991 Bill Inmon publishes the book Building the Data Warehouse

1995 The Data Warehousing Institute, a for-profit organization that promotes data

warehousing, is founded.

1996 Ralph Kimball publishes the book The Data Warehouse Toolkit

1997 Oracle 8, with support for star queries, is released

Page 62: Data Warehousing & Mining Techniques - TU Braunschweig ·  · 2009-04-02Data Warehousing & Mining Techniques Wolf-Tilo Balke ... • Building the Data Warehouse – William H. Inmon

• Data Warehouse architecture

– Basic architectures

– Storage models

– Layers

Next lecture

Layers

– Middleware

Data Warehousing & Data Mining– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62