CHAPTER 5 BUSINESS INTELLIGENCE: DATA WAREHOUSING, DATA ACQUISITION, DATA MINING, BUSINESS ANALYTICS, AND VISUALIZATION 5-1

CHAPTER 5CHAPTER 5BUSINESS INTELLIGENCE: DATA BUSINESS INTELLIGENCE: DATA WAREHOUSING, DATA ACQUISITION, DATA WAREHOUSING, DATA ACQUISITION, DATA MINING, BUSINESS ANALYTICS, AND MINING, BUSINESS ANALYTICS, AND VISUALIZATIONVISUALIZATION

5-1

Data, Information, Knowledge

5-2

Data Collection, Problems, and Quality

Data Collection: could be done manually or by instruments and sensors

Data collection methods are surveys (using questionnaires), observations (using video cameras), and collecting information from experts (e.g., using interviews). In addition, sensors and scanners are used for automatic data collection.

Suggest a reliable method of data collection to be used to identify a customer buying patterns.

5-3

Data Collection, Problems, and Quality (con.)

5-4

Data Problems The major DSS data problems are summarized in following table along with some possible solutions

Data Collection, Problems, and Quality (con.)

5-5

Data quality determines the usefulness of data as well as the quality of the decisions based on them.

Data quality problems are divided into following four categories and dimensions:

Contextual data quality Intrinsic data quality Accessibility data quality Representation data quality

Often neglected or casually handled Problems exposed when data is summarized

Data Integrity 5-6

Data integrity assures the accuracy and consistency of data

One of the major issues of DQ is data integrity

Data integrity issues Uniformity Version Completeness check Conformity check Genealogy or drill-down

Data Access and Integration5-7

Recognize what to access • Integrate disparate and

heterogeneous databases to develop enterprise-wide systems

XML becomes standard language for database integration and data, transfer

Database Management Systems

5-8

Software program for managing a database

Manages data (i.e. update , delete , insert, sort, manipulate and retrieve data)

Generates reports Better data security Combines with modeling language for

construction of DSS

Database Models

Relational Flat, two-dimensional tables with multiple

access queries It is simple for the user to learn & easily

expanded or altered Can be accessed in a number of formats not

anticipated at the time of the initial design and development of the database

It can support large amount of data Hierarchical

Top down, like a tree Fields have only one “parent”, each “parent”

can have multiple “children” quick & useful mainly in transaction processing

Network Relationships created through linked lists, using

pointers “Children” can have multiple “parents” It can save storage space through the sharing of

some items

5-9

Database Models (con.)5-10

Object oriented Data analyzed at conceptual level Inheritance, abstraction, encapsulation

Multimedia Based Multiple data formats like JPEG, GIF, bitmap, PNG, sound,

video, virtual reality Requires specific hardware for full feature availability

Document Based Document storage and management

Intelligent Intelligent agents and ANN

Inference engines

Data Warehouse

© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang

5-11

is a comprehensive database that supports all decision analysis required by an organization by providing summarized and detailed information.

It has access to all information relevant to the organization, which may come from many different sources, both internal and external.

Data Warehouse (con.) Data extraction:

get data from sources Data cleaning:

detect errors in the data and rectify them when possible

Data transformation: convert data from host format to warehouse

format , check integrity Load:

sort, summarize, consolidate, compute views, and build indices and partitions

propagate the updates from the data sources to the warehouse

12

Data warehouse characteristicsSubject orientedData from both internal and external sources is presentedScrubbed so that data from heterogeneous sources are standardizedTime-variant Nonvolatile

Read onlyNot normalized; may be redundantMetadata included

5-13

Characteristics of Data Warehouses- Subject oriented Organized around major subjects, such as

product, sales.

Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision process.

14

Characteristics of Data Warehouses- Integrated Constructed by integrating multiple,

heterogeneous data sources.

Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions

(e.g.,LastName and FamilyName in DB1 and DB2 have the same signification)

encoding structures (e.g, Attribute User_Id is along int in DB1 and it is a string in DB2

attribute measures (e.g, cm vs inch) …

15

Characteristics of Data Warehouses- Time Variant Data warehouse data : provide

information from a historical perspective (e.g., past 5-10 years)

Every data in the data warehouse contains an element of time.

16

Characteristics of Data Warehouses- Non Volatile Operational update of data doesn’t occur

in the data warehouse environment.

Doesn't require transaction processing, recovery, and concurrency control mechanism.

Require only two operations in data accessing Initial loading of data and quering.

17

Characteristics of Data Warehouses- Metadata included Metadata refers to data about data The primary purpose of metadata should

be to provide context to the data; that is, enriching information leading to knowledge

Plays vital role in explaining how , why, and where data can be found, retrieved, stored and used efficiently in an information system

5-18

Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB

integration: Build wrappers/mediators on top of heterogeneous

databases Query driven approach

A query posed to a client site, will be transformed into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

Data warehouse: Update-driven

Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

19

Data Warehouse vs. operational databases

DW Traditional DB

Large amount of data from multiple sources that may include different DB models or files acquired from independent systems and platforms.

It is a transactional (relational, object-oriented)

Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Optimizes for retrieval.

Focusing on daily operations or transaction processing Optimizes for routine transaction processing

Provide information from a historical perspective (e.g., past 5-10 years).

Current value data.

It is nonvolatile. In traditional DB ,transactions are the agent of change to the database.

Supports DSS, Data Mining and OLAP. Supports OLTP.

20

From tables to Data cubes A data warehouse is based on a multidimensional data

model which views data in the form of data cube. A data cube, such as sales, allows data to be modeled

and viewed in multiple dimensions: Dimension tables contains descriptions about the

subject of the business. such as item (item_name, brand, type) or time (day,

week, month, quarter, year Fact table contain a factual or quantitative data

Fact table also contains measures (such as dollars_sold) and keys to each of the related dimension tables.

21

From tables to Data cubes (cont.)

Relational representation of pivot table

22


2-D view of sales cross-tabulation (pivot table)

23


24

Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions &

measures Star schema: a fact table in the middle

connected to a set of dimension tables.

Snowflake schema: a refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension table, forming a shape similar to snowflake.

Fact constellations: multiple fact tables share dimension tables, viewed as a collection of stars.

25

Example of Star Schema26

Example of Snowflake Schema

27

Example of Fact Constellation

28

Multidimensional Data

Dimensions are : product, month, region

Measure is sales_amount

29

Data Marts5-30

It is a subset of data warehouse, typically consisting of single subject are

Dependent Created from warehouse Replicated

Functional subset of warehouse

Independent Scaled down, less expensive version of data

warehouse Designed for a department or SBU or department Organization may have multiple data marts

Difficult to integrate

OLAP

It refers to variety of activities usually performed by end users in online systems.

No agreement on what activities are considered OLAP. However, one OLAP tool includes such activities as: Requesting ad hoc report and graphs Conducting statistical analysis Modeling and visualization capabilities Building DSS

5-31

OLAP Tools Known as business intelligence, business analytics, decision support, data access,

database front ends OLAP vs. OLTP tools

Codd’s 12 rules of OLAP tool Multidimensional conceptual view Transparency Accessibility Consistent reporting performance Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels

5-32

OLTP vs. OLAP

OLTP

(On Line Transaction Processing)

OLAP

(On Line Analytical Processing)

User Any one Decision-makers, analysts.

Function Day to day operations. Decision support.

DB Design Application-oriented (E-R based) Subject-oriented (Star, snowflake)

Data Current. Historical.

View Detailed. Summarized.

Access Read/write. Read Mostly.

# Records accessed

Tens. Millions.

#Users Thousands. Hundreds.

Db size 100 MB-GB. 100 GB-TB.

33

Typical OLAP operations

Roll up (drill-up): summarize data by climbing up hierarchy by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, introducing new dimensions

Slice and dice: project and select Slice

Performs a selection on one dimension of the given cube, resulting in a sub-cube.

Reduces the dimensionality of the cubes. Dice

Refers to range select condition on one dimension, or to select condition on more than one dimension.

Reduces the number of member values of one or more dimensions. Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

34

OLAP-Roll up (drill-up)35

Roll-Up

Food Line Outdoor Line CATEGORY_total

Canada 29,116.5 69,310 98,426.5

Mexico 12,743.5 24,284 37,027.5

United States 102,561.5 232,679 335,240.5

Food Line Outdoor Line CATEGORY_total North America 144,421.5 326,273 470,694.5

OLAP-Drill down (roll down)

36

Food Line Outdoor Line CATEGORY_total Asia 59,728 151,174 210,902


Malaysia 618 9,418 10,036

China 33,198.5 74,165 107,363.5

India 6,918 0 6,918

Japan 13,871.5 34,965 48,836.5

Singapore 5,122 32,626 37,748

Belgium 7797.5 21,125 28,922.5

Drill-Down

OLAP-Slice37

Slice

Food Line Outdoor Line CATEGORY_total North America 144,421.5 326,273 470,694.5

992,481690,751301,730REGION_total

470,694.5326,273144,421.5North America

310,884.5213,30497,580.5Europe

210,902151,17459,728Asia

CATEGORY_total

Outdoor Line

Food Line

992,481690,751301,730REGION_total

470,694.5326,273144,421.5North America

310,884.5213,30497,580.5Europe

210,902151,17459,728Asia

CATEGORY_total

Outdoor Line

Food Line

OLAP-Dice38

Food Line Outdoor Line

Mexico 12,743.5 24,284

United States 102,561.5 232,679

Dice


Canada 29,116.5 69,310 98,426.5

Mexico 12,743.5 24,284 37,027.5

United States 102,561.5 232,679 335,240.5

Data Mining5-39

Process that uses statistical, mathematical, artificial intelligence, and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases

Automatic and quick data analysis Data mining includes tasks/activities known as:

Knowledge extraction Data archaeology Data exploration Data pattern processing Data dredging Information harvesting

How Data Mining Works5-40

Three types of methods are used to identify patterns in data Simple models (SOL-based query, OLAP, human judgment) Intermediate models (regression, decision trees, clustering) Complex models (neural networks, other rule induction)

Data mining application classes Classification Clustering Association Sequencing Regression Forecasting Others

Hypothesis Vs. Discovery Driven Data Mining Hypothesis driven data mining begins

with a proposition by the user, who then seeks to validate the truthfulness of the proposition. For example, a marketing manager may begin with the proposition, "Are DVD players sales related to sales of television sets?"

Discovery- driven data mining finds patterns, associations, and relationships among the data. It can uncover facts that were previously unknown

5-41

Tools and Techniques5-42

Data mining tools and techniques Statistical methods (association , regression

and cluster ) Decision trees (classification , clustering ) Case based reasoning(pattern detection ) Neural computing (pattern detection ) Intelligent agents Genetic algorithms

Text Mining

It is the application of data mining to nonstructured or less structured text files

It helps the organization to: Find the "hidden" content of documents, including additional useful

relationships. Relate documents across previous unnoticed divisions; for

example, discover that customers in two different product divisions have the same characteristics.

Group documents by common themes; for example, all the customers of an insurance firm who have similar complaints and cancel their policies

5-43

Multidimensionality5-44

It is an efficient way to organize data in different ways for analysis and presentation.

Its major advantage is that the data will be organized according to managers need, not analysts

Three factors ate considered in multidimensionality: dimensions, measures, and time. Here are some examples: Dimensions: products, salespeople, market segments,

business units, geographic locations, distribution channels, countries, industries

Measures: money, sales volume, head count, inventory profit, actual vs. forecasted

Time: daily, weekly, monthly, quarterly, yearly.

Data Visualization5-45

Technologies supporting visualization and interpretation Digital imaging, GIS, GUI, tables,

multidimensions, graphs, VR, 3D, animation Identify relationships and trends

Data manipulation allows real time look at performance data

Multidimensionality

Multidimensionality has some limitations The multidimensional database can take up

significantly more computer storage Multidimensional products cost significantly

more Database loading consumes system

resources and time, depending on data volume and number of dimensions.

Interfaces and maintenance are more complex than in relational databases.

5-46

Geographic Information System (GIS)

5-47

Computerized system for managing and manipulating data with digitized maps Geographically oriented Geographic spreadsheet for models Software allows web access to maps Used for modeling and simulations

GIS (con.)5-48

References

" 4 Regression." Regression. N.p., n.d. Web. 13 Nov. 2014.

"5 Classification." Classification. N.p., n.d. Web. 13 Nov. 2014.

"7 Clustering." Clustering. N.p., n.d. Web. 13 Nov. 2014.

"8 Association." Association. N.p., n.d. Web. 13 Nov. 2014.

5-49

Documents

CHAPTER 5 BUSINESS INTELLIGENCE: DATA WAREHOUSING, DATA ACQUISITION, DATA MINING, BUSINESS ANALYTICS, AND VISUALIZATION 5-1