30
Introduction to Data Warehouse and Data Mining MIS 2502 Data Analytics

Introduction to Data Warehouse and Data Mining MIS 2502 Data Analytics

Embed Size (px)

DESCRIPTION

Introduction to Data Warehouse and Data Mining MIS 2502 Data Analytics. Importance of data. What organizations do with data? Transaction Processing (E)commerce: Amazon.com; PNC Bank B2B systems: Supply Chain Management Web Search Decision Making Financial reporting Inventory management - PowerPoint PPT Presentation

Citation preview

1

Introduction to Data Warehouse and Data Mining

MIS 2502Data Analytics

2

Importance of data

What organizations do with data? Transaction Processing

– (E)commerce: Amazon.com; PNC Bank– B2B systems: Supply Chain Management– Web Search

Decision Making– Financial reporting– Inventory management– Budget allocation– Customer Relationship Management– Target Marketing– Product Design and Promotions– Fraud Detection

3

Unnormalized data set

Patient ID

Name Address DOB Doc ApptDate

Location DX

111111 Cindy Marselis

2320 Edge Hill Road

1/11/64 Armstrong 9/1/09 11:00 AM

Alter 2011 Herniated DiscFlu

111111 Cindy Marselis

9331 Rising Sun Avenue

1/11/64 Morningstar 9/1/09 11:00 AM

Alter 2011 Herniated Disc

111111 Cindy Marselis

2320 Edge Hill Road

1/11/64 Allen 11/1/09 10:00 AM

Alter 2012 Psoriasis

222222 Kathryn Marselis

2320 Edge Hill Road

11/3/04 Dershaw 8/1/09 11:00 AM

Speakman 105

Well baby check

111111 Cindy Schwartz

9331 Rising Sun Avenue

1/11/64 Armstrong 8/11/09 3:00 PM

Alter 105 PsoriasisHerniated Disc

4

Normalized db - before

5

Normalized db - after

6

Decision Making with Databases

Databases are used for transaction processing

Data from transaction processing is used for tactical decision making– Database provides basic reporting function

But…

7

Different managers require different data and data may come from other part of the organization or outside the organization

External and internal forces require tactical and strategic decisions

Search for competitive advantage Business environments are dynamic Decision-making cycle time is reduced

The Need for Data Analysis

8

Some Questions Analysts Need to Answers

Sales analysis:What are the sales by quarter and geography?How do sales compare in two different stores in the same state?Profitability analysis:Which is the most profitable store in the state CA? Which product lines are the highest revenue producers this year?Which products and product lines are the most profitable this quarter?Sale force analysisWhich salesperson is the best revenue producer this year?Do salesperson X meet his sale target this quarter?

9

From transaction processing to supporting decision making

10

Operational data – Relational, normalized database – Optimized to support transactions – Real time updates

DSS – Snapshot of operational data– Summarized – Large amounts of data

Data analyst viewpoint– Timespan– Granularity– Dimensionality

Operational vs. Decision Support Data

11

Integrated– Centralized– Holds data retrieved from entire organization

Subject-Oriented – Optimized to give answers to diverse questions– Used by all functional areas

Time Variant – Flow of data through time– Projected data

Non-Volatile – Data never removed– Always growing

Data Warehouse

12

Data Warehouse

Extraction, transformation, and loading (ETL) – a process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions, and loads the information into a data warehouse

Data mart – contains a subset of data warehouse information

13

ETL – Extraction, Transformation, Load

Transform: cleanse data for consistency and output exceptionso Apply business ruleso Selecting certain columns to load

(not null records)o Translating coded values (1, M, male

= 0 )o Derive new calculated value

(sale_amount = qty * unit_price)o Join data from multiple sources

(lookup, merge)o Aggregate (rollup/summarize data)o Transpose/pivot (turning columns

into rows)o Data validation.

Extract

data from source systems

Load:

data into repository

14

Data in a Data Warehouse

Data for a data warehouse is obtained from a variety of databases

– E.g. customer database, transaction database, accounts database

Data in data warehouse is multidimensional

15

Multidimensional Analysis

Cube – common term for the representation of multidimensional information

16

Data-modeling technique Maps multidimensional decision support into relational

database Yield model for multidimensional data analysis while

preserving relational structure of operational DB Four Components:

– Facts– Dimensions– Attributes– Attribute hierarchies

Star Schema

17

Simple Star Schema

Figure 13.12

18

Slice and Dice View of Sales

Figure 13.14

19

Facts and dimensions represented by physical tables in data warehouse DB

Fact table related to each dimension table (M:1) Fact and dimension tables related by foreign keys Subject to the primary/foreign key constraints

Star Schema Representation

20

Star Schema for Sales

Sales fact table and its four dimensions: location, time, product, and customer .Allows sales to be aggregated by time, geographic location, product, and by customer.

21

Data Warehouse to Data Marts

Given the large size of a data warehouse, organizations create data marts

– Subject oriented data

– Subset of data in a data warehouse

– Used for focused decision-making

22

Advanced data analysis environment Supports decision making, business modeling, and

operations research activities Characteristics of OLAP

– Use multidimensional data analysis techniques– Provide advanced database support– Provide easy-to-use end-user interfaces– Support client/server architecture

Online Analytical Processing (OLAP)

23

OLAP Client/Server Architecture

Figure 13.6

24

Seeks to discover patterns or relationships within the data

Data mining tools automatically search data for patterns and relationships

Data mining tools– Analyze data– Uncover problems or opportunities– Form computer models based on findings– Predict business behavior with models– Require minimal end-user intervention

Data Mining

25

What Are Data-Mining Tools?

26

Data Mining Process

27AB113 - Information Technology

Business Intelligence

28

MS SQL 2008 Architecture

Dimensional Model/Star SchemaRelational Model

29

Back Room—Data prepared from many sources Front Room—Information presented

30

Multidimensional Analysis and Data Mining

Differences between databases and data warehouse/data mart?

Data mining – the process of analyzing data to extract information not offered by the raw data alone– To perform data mining users need data-mining tools

» Data-mining tool – uses a variety of techniques to find patterns and relationships in large volumes of information and infers rules that predict future behavior and guide decision making

Business intelligence – taking data from multiple sources and turn it into useful and easy to understand information to support decision-making efforts for various kinds of people.