A Kit For Knowledge Discovery

A Kit For Knowledge Discovery

2

Data, Data everywhere yet ...

I can’t find the data I need data is scattered over the network many versions, subtle differences

I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly documented

I can’t use the data I found results are unexpected data needs to be transformed from one

form to other

?• There are sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (e.g., patterns) in data.

• Achieving Standardized Process Model

What is KDD ?

1

Legitimate

Innovative

2

Probably

useful

3

Accurate understandable patterns in data.

Knowledge Discovery in Data is the significant

method of evaluating

______

______

______

Transformed Data

Patternsand Rules

Target Data

RawData

KnowledgeData MiningTransformation

Interpretation& Evaluation

Selection& Cleaning

Integration Understan

ding

Knowledge Discovery Process

DATAWarehouse

Knowledge

Outcomes of Data Mining

Forecasting Future

Clustering Based On Attributes

Events Correlation – Association

Classification on Recognizing patterns

Sequencing Events ~ Later Predictions

Data Mining

Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data

Data Mining

+ =Data

Interestingnesscriteria

Hiddenpatterns

Data Mining

+ =Data


Hiddenpatterns

Type of Patterns

Data Mining

+ =Data


Hiddenpatterns

Type of data Type of Interestingness criteria

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

What is a Data Warehouse?

12

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

Data

Information

Data Mining Process

1. Problem Definition

2. Data Integration & Cleaning

3. Model Framing & Evaluation

4. Knowledge Discovery

3

2

1

4

Basic Operations in DM

Predictive:

Regression

Classification

Collaborative Filtering

Data Mining Task

Descriptive:

Clustering / Similarity Matching

Association rules

Deviation detection

Why Machine Learning

Growing flood of online data

Budding industry

Progress in algorithms and theory

• Data mining: using historical data to improve decision– medical records medical knowledge⇒

– log data to model user

• Software applications we can’t program by hand– autonomous driving

– speech recognition

• Self customizing programs– Newsreader that learns user interests

Machine Learning

Text

Unsupervised

Supervised

Unsupervised

Data have no target attribute. Explore Data to find Patterns

Machine Learning

Data Mining

Supervised

Discover patterns in the data.Presence of Target Attribute

Applications Of Data Mining

Applications of Data MiningFraud/Non-Compliance Anomaly detection

Isolate the factors that lead to fraud, waste and abuse

Target auditing and investigative efforts more effectively

Credit/Risk Scoring

Intrusion detection

Recruiting/Attracting customers

Maximizing profitability (cross selling, identifying profitable customers)

Service Delivery and Customer Retention

Build profiles of customers likely to use which services

Tools For Data Mining

LinkOut NCBI Sequin Rapid Miner LibSvm ADaM

etc….

Why Weka

Weka is a collection of machine learning algorithms for data

mining tasks.

The algorithms can either be applied directly to a dataset or

called from your own Java code.

Weka contains tools for data pre-processing, classification,

regression, clustering, association rules, and visualization.

It is also well-suited for developing new machine learning

schemes.

About WEKA

Waikato Environment for Knowledge Analysis (WEKA)

Developed by the Department of Computer Science, University of Waikato,

New Zealand

Machine learning/data mining software coded in Java

Used for research, education, and applications

Exclusively for KDD.

Various Versions are available such as Version 2.3, 1998; Version 3.0, 1999;

Version 3.4, 2003; Version 3.6, 2008.

Weka GUI Chooser

A Vital Part In Weka

ww.themegallery.com

Explorer

Weka !!!!!!!!

Weka is a collection of machine learning algorithms for data mining tasks.

The algorithms can either be applied directly to a dataset or called from

your own Java code.

Weka contains tools for data pre-processing, classification, regression,

clustering, association rules, and visualization.

Perfectly suited for developing new machine learning schemes.

Weka’s Structural Layout

Explorer

Experimenter Knowledge Flow

Simple CLI

An environment for exploring data with WEKA

Supports the same functionsas the Explorer but with drag-and-drop

Performing experiments and conductingstatistical tests between learning schemes

Provides a simple command-line interface that allows directexecution of WEKA

Algorithms

www.themegallery.com

WEKA ! File

WEKA stores data in flat files (ARFF format).

Easy to transform EXCEL file to ARFF format.

ARFF file consists of a list of instances

ARFF file can be created using Notepad or Word.

Name of the dataset is with @relation

Attribute information is with @attribute

Data is with @data.

Attribute Attribute Relation Relation

File File Format Format (ARFF)(ARFF)

Sample ARFF

Intrinsic Operations

Select Attributes

Associate

Cluster

Classify

Preprocess

55

44

33

22

1

Preprocessing

Changing Data formats as per the Needs.

Varies as Per Mining Datasets.

Some of the Preprocessing Steps

Adding/removing attributes

Attribute value substitution

Discretization (MDL, Kononenko, etc.)

Time series filters (delta, shift)

Sampling, randomization

Missing value management

Normalization and other numeric transformations

Algorithms

Pre-Processing

Browse for the datafile in local filesystem.

RelationsRelationsInstances Instances SchemaSchema

Attributes Attributes FiltersFilters

Opening Files Current Relation Operations

Weka – Formulating Files

Dataset -.txt Format

Weka ~ Dataset’s

Missing Values

GenericObjectEditor

A Property Editor for objects as editable in the

GenericObjectEditor configuration file, which lists possible

values that can be selected from, and themselves configured.

The configuration file is called "GenericObjectEditor.props"

and may live in either the location given by "user.home" or the

current directory (this last will take precedence), and a default

properties file is read from the weka distribution.

Weka ~ GenericObjectEditor

This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.

This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.

Sample - Cluster

Attributes for Cluster

Weka’s Viewer

PCA Analysis

Pre-Processing Retrievals

BeforeBefore AfterAfter

Retrieving Significant Attributes

Algorithms

Feature Selection

Some columns are noisy or redundant. This noise makes it more difficult to

discover meaningful patterns from the data;

To discover quality patterns, most data mining algorithms require much

larger training data set on high-dimensional data set.

Feature selection, also known as variable selection, feature

reduction, attribute selection or variable subset selection,

is the technique of selecting a subset of relevant features for building

robust learning models

Attribute Selection

Attribute selection involves searching through all possible combinations of

attributes in the data to find which subset of attributes works best for

prediction.

To do this, two objects must be set up:

The evaluator determines what method is used to assign a worth to each

subset of attributes.

The search method determines what style of search to be done

The Attribute Selection Mode box has two options:

1. Use full training set.

2. Cross-validation.

Attribute Selection

Very flexible: arbitrary combination of search and evaluation methods

Both filtering and wrapping methods Search methods

best-first genetic ranking ...

Evaluation mmeasures Relief information gain gain ratio ...

Applying Algorithm

Best Attribute

Algorithm……

Classification

Classification is a data mining function that assigns items in a collection to

target categories or classes.

The goal of classification is to accurately predict the target class for each

case in the data.

A classification task begins with a data set in which the class assignments

are known.

For example, a classification model that predicts credit risk could be

developed based on observed data for many loan applicants over a period of

time

Classification ~ Naive Bayes classifier A naive Bayes classifier assumes that the presence (or absence) of a

particular feature of a class is unrelated to the presence (or absence) of any

other feature, given the class variable.

For example, a fruit may be considered to be an apple if it is red, round, and

about 4" in diameter.

Even if these features depend on each other or upon the existence of the other

features, a naive Bayes classifier considers all of these properties to

independently contribute to the probability that this fruit is an apple.

Naive Bayes Classifier

Confusion Matrix –Pervasive Role

Confusion Matrix - Dataset

Second Fold -Classification

Algorithms

Clustering

Clustering is the task of assigning a set of objects into groups

(called clusters) so that the objects in the same cluster are more similar (in

some sense or another) to each other than to those in other clusters.

Belong to Unsupervised Learning

Example ~ Weka

Attributes Replacements

Updations

K- Means

Visualizer

Open Saved File

Save File =>Will Store in ARFF

Visualizer – Samples

Association rules

Association rules are if/then statements that help uncover relationships

between seemingly unrelated data in a relational database or other

information repository.

Finding frequent patterns, associations, correlations, or causal structures

among sets of items or objects in transaction databases.

An example of an association rule would be "If a customer buys a dozen

eggs, he is 90% likely to also purchase milk.“

Market Basket Analysis

Association

Description

Rules Framing

Rules Set

Visualize

Result Analysis

WekaWeka

Result 2Result 2

Result 1 Result 1

ConceptConcept

Documents

A Kit For Knowledge Discovery