Intro to Data Mining

8/8/2019 Intro to Data Mining

http://slidepdf.com/reader/full/intro-to-data-mining 1/27

Introduction toSoftware Engineering

NOTES

Self-Instructional Material 3




NOTES

UNIT 1 INTRODUCTION TO

DATAMINING &

WAREHOUSING (FULL TEXT)

Structure

1.0 Introduction

During the Early Nineties Industries realised that they were not getting

the promised returns on their investment in IT infrastructure. A major

focus of the industrial leaders was to utilise IT as a strategic tool to

maximise profits. They expected IT to leverage their decision making

capability and not merely in terms of obtaining MIS reports which were

primarily routine in nature and did not help them in sifting through

voluminous data and identifying hidden, camouflaged or implied

information . The emphasis therefore shifted from generating MIS reports

to what was termed as KDD( Knowledge Discovery in Databases). KDD by itself involved a large number of components out of which the most

desirable was extraction of useful information or patterns from massive

corporate data and was termed as “Data Mining”. Following text is a

systematic presentation of Data Mining techniques and Data Warehouse

framework which is the repository of the vast, integrated, time variant,

historical & subject oriented data, to be operated upon.

1.1 Unit Objectives

• Explaining the evolution of Data Mining & Warehousing.

• Understanding the components of Data Mining & Warehousing system.

• To study the complementary relationship of Data Mining & Warehousing.

• To learn the steps involved in Implementation of Data Mining &

Warehousing architecture.

• Identifying problems associated with Data Mining & Data Warehousing

Framework.

• To know the role of Data Mining & Warehousing in strategic decision

making & giving a competitive advantage to a business activity.

1.2 Emergence of Data Mining & Warehousing

1.2.1 Business needs drive technology

Data Mining & Warehousing have now become familiar words for not only the

computer professionals but most of the decision makers A rapid growth has taken

place in developing a technology surrounding them, with most of the leading

companies of the world creating products and services to exploit its potential. ItSelf-Instructional Material 4




NOTES

therefore brings out an underlying fact before us, that is, any technology results as an

outcome of Business needs. In this case, it was an inescapable and urgent

requirement of the Business community to have a powerful, online, decision making

tool to support and substantiate its own intuitive, thought processes. It has thus

resulted in development of Data Mining & Warehousing technology.

1.2.2 IT solutions for Strategic Decision making

Information Technology has emerged as a powerful business driver and an essential

component for the companies, to give them a competitive advantage in the

challenging market scenario. Each and every aspect of IT is being closely examinedand integrated into the business activity. But by far the very existence of the industry

depends on the strategic decisions taken by its top echelon, because these are the

once which are being translated by its middle level and operational staff into actions.

Data Mining & Warehousing IT solutions have taken Strategic Decision making

beyond the realm of conventional MIS (Management Information System) and OIS

(Operational Information System) boundaries.

1.2.3 Focus on the End Users

Data Mining & Warehousing , as we have seen has been developed for the Managers

& top level executives to assist them in reaching decisions based not only on facts

and figures seen superficially, but drawing inferences from hidden and widely

dispersed uncorrelated data. The focus of the system is on End Users and meeting

their requirements by offering simple interfaces. It is essential to keep the Front End

tools simple, less complicated and user friendly. The experts have to create a system

keeping in view the technical limitations of the End user, their lack of understanding

of system capabilities in the initial stages and support them to achieve the desired

result.

1.3 Evolution of Data mining & Warehousing

Data mining & Warehousing has grown to its full potential today in a number

of clearly demarcated stages. Prior to 1970s we had the flat files & databases.

The emphasis then was on Data Collection & Database Creation. A manager

was able to use them only for simplifying his/her day to day task.

These were further developed into DBMS from 1970s to Early1980s, offering

advantages like concurrent, shared or distributed access, ensuring the


Information Systems

Operational

Information System

Management

Information System

ERP, CRM, SCM

etc

DSS, EIS,Expert

Systems etc




NOTES

consistency and providing information security, providing us the backbone for

the Reports and supporting the conventional MIS and OIS features.

Later on the DBMS followed three growth paths. These were:

(a)Advanced database Systems – Developed from1980s till present.

They included Object–relational, Spatial, Temporal, Biological Database

systems.

(b) Data Warehousing & Data mining – Evolved from 1980s till present as Knowledge Discovery & Data Mining, Data warehouse &

OLAP technologies. Their full potential was not realised due to a large

number of limitations of the then Hardware, Networking and software

constraints.

(c) Web Based Database Systems(1990s till present) were based on

phenomenal reach of Internet, XML Based systems, Web technology

and Web mining.

1980 till today

1960-1970 1970-1980 2000 onwards

The latest trend in the decade is Integrated information systems based on

above three paving way for revolutionary development in Decision making in

the corporate world.


Flat Files,

Database

DBMS

Advanced

Databases

Data

Mining

Web

Database

Integrated

Database




NOTES





NOTES

1.4 INTRODUCTION TO DATA

MINING

1.4.1 What is Data mining

Data Mining means locating, identifying and finding unforeseen information

from a large data base. The information is one which is interesting to the end

user. It can also be understood as data analysis based on searching or learning

dependent on deduction.

1.4.2 What is Interestingness of Pattern?

A data pattern discovered through a data base search is considered

interesting, if it is easily understood, is valid on new or test data with some

degree of uncertainty, potentially useful and is novel. Interesting patterns are

identified by objective parameters which are combined with the subjective

requirements to reflect the needs and interests of a particular user.

1.4.3 Data mining & Knowledge Discovery;

Difference between Data Mining & Knowledge Discovery in Data bases

How are they different? Data Mining is devoted specifically to the processes

involved in extraction of useful information by applying specific techniques

based on certain knowledge domains. These are say, based on statistics,

Artificial Intelligence, and so on. While Knowledge discovery is a wide term

and is the entire range of activities right from deciding Business objectives,

Capturing desired data, preparing ,processing, arranging it, applying predefined

techniques and then presenting them in an understandable form to the user, To

say specifically Knowledge discovery can be sub-divided into Four specific

steps which are performed repetitively till the desired result is reached, one of them is Data Mining.

1. Data Processing comprising of Data Selection, Data Cleaning,

Data Integration

2. Data Transformation & organising in a form ready for fast access

3. Data Mining( DM Engine) and other techniques like OLAP/

OLTP for searching and extraction.

4 Knowledge presentation methods through Graphical User

Interface ( GUI).

5. Analysing the Result and Assimilating it in a knowledge domain

Following diagram refers:





NOTES

We can thus consider Data mining as a subset of Knowledge Discovery.

1.4.4 Nature of Data to be mined – Operational & Analytical


Data Processing

Data Transformation

Data Mining Engine

Knowledge Presentation

Through GUI

Result Analysis




NOTES

Data Mining is an essential step towards the creation of Information systems. These

are Operational Information Systems like Enterprise Resource Planning ( ERP) or as

Management Information Systems including Decision support Systems The DSS

systems assist managers in taking decisions based on available unstructured data and

validate their intuitive judgements. OIS & DSS each has its own requirement of Data

Structures and Databases.

The Data in turn is categorised as Operational data which is dynamic in nature and

meets short term goals. Analytical data has a longer time span and supports intuitive

decisions. Operational Database supports Transaction processing through On LineTransaction Processing (OLTP) Queries. Analytical Database meets On Line

Analytical Processing (OLAP) requirements of Decision Support Systems (DSS).

Differences in DB requirement differences for OLTP & DSS

Characteristic DB for OLTP DB for OLAP Needs

1.Nature of content Dynamic Static

2. Time span Current Historical

3. Time measured Implicit, Implied Explicit & mentioned

4. Level of Detail Primitive/ Detailed Detailed & Derived/Granularity

5. Update cycle Real time Periodic, planned

6. Tasks Known Pattern, Repetitive Unpredictable

7. Response Time bound Flexible

1.4.5 OLTP & OLAP

Having seen the Database requirement of OIS & DSS let us differentiatethe Query systems associated with each. These are OLTP & OLAP. OLTP

fulfills the requirements of OIS well, as the Queries are simple in nature .

OLAP, on the other hand, addresses the needs of defining more complex

queries and requires novel Databases in the form of Multi Dimensional &

Multi Relational Databases ( MDDB & MRDB respectively) to provide

the back end.

Features of both OLAP & OLTP are compared below

Feature OLTP OLAP

1. Meant for OIS MIS/DSS

2. Purpose Supports Transaction For Analysis

3. End User Operations Level, DB Specialists Knowledge worker

4. Function Daily operations Long term needs





NOTES

5. DB Design ER based, Star/snowflake schemas

Application oriented Subject oriented

6. Data Current, up-to-date Historical,

7. Summarization Primitive, Highly detailed Aggregated

8. View Relational Multidimensional

Multi-Relational

9. Work unit Short, simple transaction Complex query

10. Access Mode Both Read/Write Mostly read

11. Based on Data inputs Derived Information

12. Operations Operation on primary key Multiple scans

13. Number of records accessed Few Many

14. Number of Users Large Number Selected

15. DB Size In MB /GB In over 100GB to TB

16. Priority High performance & availability High flexibility,

End user autonomy

17. Measure Transactions throughput Specific Query

Comparison Database & DWH

DB(Transaction Processing) DWH(DSS)

1. Systematic data stored in a prescribed format Collection of

unstructured data

2. DB contains operational data Non-operational

data.





NOTES

3. Uses structured language for searching DM tools for

extracting pattern.

4. Keeps normalized data Does not store

normalized data.

Dr E.F. Codd ‘s guidelines for OLAP

OLAP is an essential ingredient of Data Mining. it is therefore essential to

understand the relevance of Dr E.F. Codd’s ( a well known authority on

RDBMS) guidelines. An interpretation of each of them is given below,

relating them to the issues involved:

1. Multidimensional Conceptual view- Business problems are complex

and can be solved only through a Multi dimensional concept as

Normal Queries cannot address them effectively. As such Multidimensional schemas are essential to create relevant databases.

2. Transparency-An end user must be presented a cohesive ,

unambiguous version of data and must not be exposed to complexity

and diversity of data sources.

3. Accessibility- Essential data must be identified and accessed.

4. Consistent reporting performance- Reporting must remain dependable

and reliable even with increase of database size.

5. Client/server architecture- DM and Warehousing systems are created

to meet growing business needs and financial constraints.

6. Generic dimensionality.-Every data dimension has the same

importance.

7. Dynamic sparse matrix handling- Provide capability to keep the

database size within limits by adopting suitable methods of handling

sparse matrices.

8. Multiuser support- The system must permit a large number of users to

be permitted access at the same time.

9. Unrestricted cross-dimensional operations- Multi dimensional schemas

must be well understood, designed and permit cross references.

10. Intuitive data manipulation- OLAP is created for Decision makers to

make intuitive decisions. They are not computer experts and must be





NOTES

provided with a user friendly uncomplicated access to generate

Queries.

11. Flexible reporting-The system must be capable of providing reports

desired by the end user.

12. Unlimited dimensions and aggregation levels The system must remain

flexible/ expandable for adding extra dimensions and permit additionalaggregations.

1.4.6 Data Mining a multidisciplinary area

Data mining is a confluence or combination of multiple disciplines. Some of

these are:

1. Information science

2. Database technology

3. Statistics

4. Machine Learning

5. Visualization

6. Other Disciplines


InformationScience

Statistics

Database

Technology

Machine

Learning

Visualisation Other Sciences

Data

Mining




NOTES

Successful Development of Data Mining System would thus require

joint efforts from experts of different domains.

1.4.7 Classification of Data mining systems

Data mining development of special algorithms to answer queries of various users.

The procedure is to evolve a number of models and to match one of them to data

stored in the database. Three steps involved in this process are: creating a model,

Find out the criteria to give Preference of a model over others and identify the search

technique.

Data Mining models being mathematical in nature are classified as Predictive and

Descriptive.

(a) A Predictive model spells in advance, the values a data may assume

,based on known results from other data stored in the database.A predictive model performs data mining tasks of Classification, Time series

Analysis, regression and Prediction.

(b) A descriptive model based on identification & relationships in data. The

descriptive model aims to discover rather than predict the properties of

data.

A descriptive model performs data mining tasks comprising of Clustering ,

Summarisation, association Rules and Sequence Discovery.

1.4.8 Data Mining TasksThe Basic tasks under Predictive and Descriptive models are:

Predictive Model

(a) Classification -Data is mapped into predefined groups or classes. Also

termed as supervised learning as classes are established prior to

examination of data.

(b) Regression- Mapping of data item into known type of functions. These

may be linear, logistic functions etc.


Data Mining

Models

Predictive Model Descriptive Model




NOTES

(c) Time Series Analysis- Value of an attribute are examined at evenly

spaced times, as it varies with time.

(d) Prediction- It means fore telling future data states based on past and

current data.

Descriptive Model

(a) Clustering- It is referred as unsupervised learning or

segmentation/partitioning. In clustering groups are not pre-defined.

(b) Summarisation- Data is mapped into subsets with simple descriptions .

Also termed as Characterisation or generalisation.

(c) Sequence Discovery- Sequential analysis or sequence discovery utilised

to find out sequential patterns in data. Similar to association but

relationship is based on time.

(d) Association Rules- A model which identifies specific types of data

associations.

1.4.9 Data Mining Primitives

A data mining task is expressed in the form of a DMQL statement and requires

certain primitives to be stated. These are:

(a)Task-relevant data- It mentions the part of the database to be examined.

(b)Nature of Knowledge to be mined- It defines the tasks or functions to be

performed on the data. Examples are Characterisation, Association, Clustering.

(c)Background knowledge- It means here the concept hierarchy as they indicate the

level of abstraction at which data is to be mined.

(d)Interestingness measures- These are defined for the task or function to be

performed. Example ,For Association rule the Support and Confidence Factors are

measured corresponding to threshold levels specified by the users as a measure of

interestingness.

(e)Presentation & Visualisation of discovered patterns- They refer to the ways in

which the result obtained can be displayed for the convenience of the user.

1.4.10 Data Mining Query Language (DMQL)

Data mining systems are required to support ad hoc and interactive requirements

of knowledge discovery from Relational Database and Multiple levels of

abstraction. Data mining languages are designed to meet this requirement. They

help us in formulating a query to define a data mining task primitives.

The primitives require:

(a) Set of task –relevant data to be mined.

(b) Nature of knowledge to be mined.

(c) Background knowledge required for the discovery.

(d) Measures of Interestingness

(e) Visualisation representation.

DMQL follows a SQL like syntax which is amenable for linking with

Relational Query languages and simplifies a users task of knowledge extraction

easier.

1.4.11 Integration of Data mining systems

A diverse number of Data mining tools may be available in an

organisation. It is essential to identify and categorise them to understand which





NOTES

model and tasks they support. Also find out if any data mining tools are being

developed in-house. Display them on the GUI of the Client Desk top to select

the right data mining tool for the problem in hand.

1.4.12 Major issues of Data mining

These are mentioned below:

(a) Human Interaction.

(b) Over-fitting

(c) Outliers.

(d) Interpretation of Results

(e) Visualisation of Results.

(f) Large Datasets

(g) High Dimensionality.

(h) Multimedia data.

(i) Missing Data.

(j) Irrelevant Data.

(k) Noisy Data.

(l) Changing data.

(m) Integration.

(n) Application

1.5 Introduction to Data Warehousing( DWH)

1.5.1 What is Data Warehouse

W.H. Inmon , well known as Father of the data warehouse concept defines “ A

Data warehouse is a subject oriented, integrated, non-volatile and time-variant

collection of data in support of management’s decisions”.

Where subject oriented means database is organised in a data warehouse on a

subject wise manner even at the expense of redundancy. Thus every manager

would have access to desired information in the shortest possible time not withstanding the extra space occupied by it.

Integrated implies, related database tables created in the form of Fact &

Dimension tables can be linked to each other and are not stored as stand alone

data resources.

Non destructible means storage of data on a permanent, non-volatile basis. It can

only be purged or removed only as an exception as an organisational need.

Time variant requires all data to be entered in the data warehouse to be time stamped

or associated with its time of entry. The time element introduced may not be the

actual time when data entered the operational system.

Data Warehouse(DWH) Block Diagram


External

Data

Admin &

Mgt. tools

Information

Delivery

S stem

Data

mining

Tools




NOTES

DATA MARTS

1.5.2 Data Warehouse Building Blocks

Data Pre-processing Tools

It means sourcing, acquisition, cleaning and transformation of data prior to its

entry into a Data warehouse Data repository. The data is received from legacy

systems, Web or other external sources. The data and the database from where itis received would itself be heterogenous and it requires:

(a) Removal of unwanted data

(b) Converting to common data & definition names

(c) Summarising the data

(d) Completing missing data

Operational Data Store

The data is transformed and loaded into the operational data store(ODS)in real

time frame. From the ODS it is loaded into the Data warehouse after extraction ,cleaning operations at regular intervals but not as and when received from

external sources. As such a time of entry is attached with it. The data thus

available is loaded under the control of Metadata.

Metadata

The metadata is data about data and keeps information as

(a) Technical metadata- containing, Sources of data, data structure,

transformation description, rules specified during data processing,

access authorisations & back up history.

(b) Business Metadata- Contains information about Subject areas,

information object types, Internet home pages, Information delivery system

details- that is when to despatch information and to whom, Data warehouse

operational information & Ownership details.

Data Warehouse Database

It is the central database consisting of Data Warehouse RDBMS,A large

Repository and supporting databases like Multi Relational Database, Multi

Dimensional Database & Data marts.

Data Mart

Data mart is another important component of the Data warehouse and is a data

store that is subsidiary to a data warehouse. It is created to meet specific

information needs of different functional area managers. Data marts are a partSelf-Instructional Material 17

Transforma

-tion Tools

OperationalData Store

Metadata

Data warehouse

DBMS & Data

Repository

MRDB

MDDB

OLAP

Tools

Report,

QueryTools




NOTES

of the data warehouse database and cannot be taken as an alternative for a data

warehouse.

Management & Administration Tools

They are provided to :

1. Managing & updating of metadata

2. Backup & recovery.

3. Removal of unwanted data

4. Security & assigning priorities5. Quality checks.

6. Distribution of data.

Access Tools

They are categorised as:

1.Query & reporting Tools

2.Application Tools- to meet specific user requirements.

3. Data Mining tools- To discover knowledge, Data visualisation and

Correcting data when the input data is incomplete.

4.OLAP Tools- These are associated with multidimensional databases to

provide elaborate, complex views for analysis

Information Delivery System

It provides an external interface to provide Data Warehouse reports

information objects to external users as per a specified schedule.

1.5.3 Granularity of Data

It means the level of detail or summarisation at which data is stored in a data

warehouse. Larger the granularity less will the detail be available for those data

item. Vice versa also holds good. A data warehouse manager is required to

identify the granularity of data for any organisation so that reports of the

requisite detail are available.

An example is maintenance of the details of each and every call made by a

mobile user by Telecom Operators to provide a high level of details (Low

level of granularity) to meet legal requirements at a later stage.Granular Data offers the advantage of reusability of data by other users and also

help in optimising the storage space.

1.5.4 Multidimensional Data Models & Schemas

Data warehouses and OLAP tools are based on what is known as a

Multidimensional model. Data is visualised as a Data cube in such model

identified by Fact & Dimension tables.

Facts are the numerical measures of a central theme. For example a Student.

The measures may be Marks_obtained, Division_Scored.

Dimensions are the entities with respect to which the organisation keeps its

records. For example Teacher, Subject, class, college, university etc.

Concept Hierarchies

It is a method of defining a sequence of identifying levels for each

entity.Example is a City, District, State and a country.

Schemas

While Entity- Relationship model was found adequate in the design of

Relational Databases, a data warehouse requires a subject oriented schema for better

analysis and handling more complex queries.

Three schemas are therefore created to meet the data warehouse requirements. These

are:





NOTES

Star schema

Most common model. In which a large central Fact table containing maximum data

without duplication or redundancy is stored. A large number of Dimension tables are

referred by it . Each of these handles a dimension. Refer to example given:

Snowflake schema

It is an extension of Star schema. The dimension tables are further extended to extra

tables.Diagram below gives an Example:

Constellation SchemSelf-Instructional Material 19

Dimensio

n Table

Key D

Fact table

Contains

Keys say

A,B,C, D &measures

Dimension

Table

Key B

Dimension

Table

Key A

Dimension

Table

Key C

Dimension

Table

Key D

Dimensio

n Table

Key D

Fact table

Contains

Keys say

A,B,C, D &

measures

Dimension

Table

Key B

Dimension

Table

Key A

Dimension

Table

Key C

Key E

Dimension

Table

Key D

Key F

Dimension

Table

Key F

Dimension

Table

Key E




NOTES

It has Multiple fact tables to meet the requirements of more advanced

applications.The fact tables are permitted to share Dimension tables. Example given

below refers:

1.5.5 Data Warehouse design

A Data Warehouse design consists of :

1. Choosing a business process to model. For example Orders, Invoices,

Shipments etc.

2. Choose a DWH for a large organisation while select a Datamart for

departmental implementation.

3. Choose the grain of the business- the fundamental, atomic level of data

to be represented in the Fact table.

4. Choose the Dimensions to be applied to each Fact table.

5. Choose the measures that will populate each fact table record e.g.

Units _sold, Rs_sold.


Dimensio

n Table

Key D

Dimensiontable

Keys A,B,C

\Fact Table 2

Key B,DFact table 1

Key A,C

Dimension

Table

Key C

Key E

Dimension

Table

Key D

Dimension

Table

Key E




NOTES

Based on these four principles a nine step method is evolved as under:

1.Choosing the subject matter.

2.Deciding what the Fact table represents

3. Identifying and conforming the dimensions.

4. Choosing the Facts.

5.Storing pre-calculations in the fact table.

6.Rounding out the dimension tables.

7. Choosing the duration of the data bases.

8.The need to track, slowly changing dimensions.

9.Deciding the Query priorities.

1.5.6 Data Warehouse Architecture

Data Warehouse Architecture-

Data Warehouse architecture is based on a RDBMS system server. It has a

massive central repository for storage of data, subsidiary Databases and front

end tools

The architecture consisting of:

1.Bottom Tier- A RDBMS & a DWH Server

2.Middle Tier-OLAP Server

3.Top Tier-Front End Tools

DATA SERVERS APPLICATION SERVERS CLIENTS

Virtual Warehouse

Another commonly used terms is a Virtual server. is a set of views over

operational databases. For efficient query processing only some of the

possible summary views are materialised. It is easy to build but requires

excess capacity on operational database servers.

Developing a Data Warehouse

It consists of:1. Defining a High level corporate data model.

2. Develop an Enterprise Data Warehouse and continue refining it to meet

user requirements.

3. In parallel Develop data marts and refine these models.

1.5.7 ROLAP,MOLAP & HOLAP

These tools utilise specialised data structures to organise, navigate and

analyse data, typically in a aggregated form. They require a tight coupling

with the application and the presentation layer.


GUIData

Warehouse

Database,

Metadata,

Data Logic

MDDB,

MRDB,

Metadata

GUI,

Presentation

Logic,Query

Specificatio






NOTES

3. HOLAP use the best features of both i.e. flexibility of ROLAP RDBMS

and the optimised multidimensional structure of MOLAP. Users are provided

ability to perform limited analysis capability either against RDBMS products

or by introducing an intermediate MOLAP server. A user can send a query to

select data from the DBMS which then delivers the requested data to the

desktop where it is placed in a data cube. The desired information is

maintained locally and need not be created each time a query is given.

4. Salient differences to be noted are:

(a) In MOLAP there is no Query directly given by the user to the

DWH Server. The desired Multidimensional data is positioned in

the MOLAP server after a SQL sent by MOLAP server is sent to

the DWH Server.

(b) ROLAP server does not store the intermediate result in a cube but

a Relational table. The user gets his query serviced by the ROLAP

server.

(c) In HOLAP, SQL is sent by user to the DWH server then either theResult is received by it directly or an intermediate MOLAP server

data cube is created and accessed by the user.

(d) ROLAP server does not store the intermediate result in a cube but

a Relational table. The user gets his query serviced by the ROLAP

server.

(e) In HOLAP, SQL is sent by user to the DWH server then either the

Result is received by it directly or through an intermediate

MOLAP server

1.6 Developing a Data Mining & Warehousing framework

1.6.1 Evolving System.

Data Mining & warehouse are not only massive but extremely critical systems

for the entire organisation .They are not to be built in one stroke but to be

gradually evolved as they require a substantial investment from the organisation

in the form of human and financial resources. It is therefore common practice

identify one subject area and build a system for it and then gradually move to

other functional areas. Several measures to curtail the investment are introduced.

One of them is only implementing a Datamart to begin with and then create the

data warehouse repository. It is also essential to select systems which are

upward scalable, keeping the Evolution factor in mind.

1.6.2 SDLC & CLDS

Systems Development Life Cycle (SDLC) is a requirements driven life

Cycle and supports the operational environment . A well known model is

the Waterfall Development .





NOTES

The CLDS ( Reverse way of saying SDLC) is the methodology followed for

developing a data mining application. It is a reverse way as the end user being a

manager and not a technocrat does not at an outset realise the potential of the

system and its decision support capability. The user expects the Technical experts to

present the available data, identify suitable algorithms and test the results.


Coding

Feasibility

Analysis

Design

Testing

Integration

Implementation

Implement

Warehouse

Integrate data

Test for

information bias




NOTES

1.6.3 Selection of Hardware of Data Warehouse

Decision to select hardware is based on:

(a) Existing & Future Business growth plan of the organisation

(b) Data expected to be stored.

(c) Performance expected from system.

(d) Identifying suitable SMP or MPP system

(e) Network Bandwidth for external connectivity

(f) Nature of legacy systems and their interfaces.

(g) Nature of Information Delivery System

(h) Data & Application Servers

(i) Client Desk top system with adequate storage & processing power

1.6.4 Selection or Development of Data Mining Tools

(a) Identify the User requirement.

(b) Categorise the model applicable and the associated tasks.

(c) Make or Buy suitable tools

(d) Find out the available data warehouse capabilities and develop an

application based on these.

1.7 Data Mining & Warehousing Challenges

1.7.1 Major issues/challenges to Data mining

These are:

1. Mining different kinds of knowledge in databases.

2. Interactive mining of knowledge at multiple levels of abstraction.

3. Incorporation of background knowledge.

4. Data mining Query languages and ad hoc data mining.

5. Presentation and visualisation of data mining results.

6. Handling noisy or incomplete data.Self-Instructional Material 25

Develope program

for data

Design DSS

System

Analyze result

Understand

requirements




NOTES

1.7.2 Data Warehouse- Open issues & research problems

Data Warehouse being an evolving system has many issues which are

open for further study & research. These are in the areas of:

1. Security- The Data warehouse contains full details of organisational

information, some of which is highly confidential. Data warehouse

provides access to many employees to access its contents. Special

efforts are t required to maintain safety, security and confidentiality of

this valuable and sensitive information residing in it.

2. Performance- Very large data bases requires their storage on multi-

processors and not conventional computers. These fall under the

category of Massively Parallel Processors and Symmetric

Multiprocessors. Both are upgradable and permit Very large database

to be partitioned and accessed at fast speeds. Features like Shared

memory protection and dynamic load balancing are in-built in them.

Designing scalable systems is an important Factor.

3. Functionality-Modification of accessing methods is a major challengeto Data warehouse managers. Developing novel Data mining

algorithms & OLAP for meeting User’s latest requirements are

essential features of extracting interesting data from warehouse.

4. Presentation- Data visualisation techniques have to be constantly

improved for the end users to utilise full capability of the system, draw

inferences and analyse the results.

1.8 Impact of Data Mining & Warehousing

1. Collecting, cleaning, transforming data from heterogeneous sources

2. Integrating data from dissimilar sources. Examples of Heterogeneous

& Dissimilar sources are legacy systems, World wide web,

organisations ERPs.

3. Storing data in a systematic manner but easily accessible manner.

Examples are Creating appropriate Fact & relation tables, Organising

data marts and Partitioning of database.

4. Organising data to facilitate –Manipulation, processing of Very largedata. A good Metadata design is a prime factor in attaining this

objective.

5 Simplify Exploration of massive data. Creation of Data mart are

essential for meeting this goal.

6 Arrange Data for interpretation & pattern recognition.Efficient Data

Mining Algorithms are necessary to achieve it.Self-Instructional Material 26




NOTES

7 Presenting data in a readable manner for the managers by designing

proper GUI.

8 Business Decision making based on Data mining & other access tools

1.9 Summary- Overview of Data Mining & Warehousing

1. Data Mining & Data warehousing principles were identified during early 1990s

but were implemented only after a decade due to non availability of:

(a) Suitable hardware at an affordable cost, to support parallel processing, providefast speeds and for storage of massive database.

(b) Operating systems to support parallel architectures.

(c) DBMS to manage very large database.

(d) Network bandwidth for interconnectivity.

(e) Suitable data mining algorithms.

(f) Visualisation & presentation tools.

2.The Data warehouse provides a platform for capturing, refining, integrating and

transforming data received from diverse sources. It is then stored in subject wise

Form in a central repository accessed through a metadata mechanism.3. The data is stored in a relational form by creating Fact & Dimension tables

connected through specialised schemas based on concept hierarchies. These are Star,

Snowflake& Fact Constellation schema.

4. The data is further organised as MRDB & MDDB which may also be considered

as part of Data marts created for different users to meet their specific requirements.

5. For producing useful reports Data mining, OLAP & Query tools and Application

tools are utilised.

6. Any result requires an effective presentation. GUI and Visualisation tools are made

available for the managers for them to assimilate and analyse the results for speedy

decision making.

7.Data mining & warehousing is constantly evolving and has to adopt a framework which is flexible and absorb the organisational and technological changes

1.10 Answer to ‘Check your Progress’

Question 1: In what way Data warehouse and Data Mining complement each other.

Question 2: Data Warehouse & Data Mining principles were formulated in early1990s but implemented only after a decade. Why?

Questions3: Differentiate between Operational Data Base, Operational Data

Store & Data Warehouse Repository.

Question 4: What is the function of “Information Delivery System”?





NOTES

Question 5 Differences between Data Mining, OLAP & Query Tools?

1.11 Exercises and Questions

1.11.1 Short-Answer Type Questions

Q1: Explain how the evolution of Database technology lead to Data

Mining?

Q2: Describe the steps involved in Data mining when viewed as a process

of knowledge discovery?

Q3: What are Data Mining models and the tasks associated with them?

Q4: What are the Data Mining issues. In what way do they affect

implementation of a Data mining system?

Q5: What is meant by interestingness of data as related to data mining?

1.11.2 Long-Answer type Questions

Q1:Define the data mining functionalities.

Q2:What is the difference between :

(a) Discrimination and Classification

(b) Characterization & Clustering

(c) Between Classification & Prediction?

(d) ROLAP & MOLAP

(e) MDDB & MRDB

(f) Data Warehouse & Data Mart.

Q3. Describe three challenges to data mining regarding Data mining

methodology and user interaction issues.

Q4: Describe two challenges to data mining regarding performance

issues:

Tip: Performance issues are:

(a) Efficiency & scalability of data mining algorithm.

(b) Parallel, Distributed and incremental mining algorithm.Q5: Describe issues related to diversity of data base types.

Tip (a) Handling of relational and complex types of data.

(b) Mining information from heterogeneous databases and

global information systems

1.12 Further Reading





NOTES

1. Data Mining Concepts & Techniques- Jiawei Han and Michelene

Kamber, Second Edition, Elsevier,2006

2. Data Warehousing, Data mining & OLAP – Alex Berson, Stephen J.

Smith, Tata McGraw Hill,2004

3. Building the Data warehouse – W.H. Inmon, Third Edition,

Wiley,2009


Documents

Intro to Data Mining