76
A Kit For Knowledge Discovery

A Kit For Knowledge Discovery

  • Upload
    ora

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

A Kit For Knowledge Discovery. Data, Data everywhere yet. I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented - PowerPoint PPT Presentation

Citation preview

Page 1: A Kit For Knowledge Discovery

A Kit For Knowledge Discovery

Page 2: A Kit For Knowledge Discovery

2

Data, Data everywhere yet ...

I can’t find the data I need data is scattered over the network many versions, subtle differences

I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly documented

I can’t use the data I found results are unexpected data needs to be transformed from one

form to other

Page 3: A Kit For Knowledge Discovery

?• There are sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (e.g., patterns) in data.

• Achieving Standardized Process Model

Page 4: A Kit For Knowledge Discovery

What is KDD ?

1

Legitimate

Innovative

2

Probably

useful

3

Accurate understandable patterns in data.

Knowledge Discovery in Data is the significant

method of evaluating

Page 5: A Kit For Knowledge Discovery

______

______

______

Transformed Data

Patternsand Rules

Target Data

RawData

KnowledgeData MiningTransformation

Interpretation& Evaluation

Selection& Cleaning

Integration Understan

ding

Knowledge Discovery Process

DATAWarehouse

Knowledge

Page 6: A Kit For Knowledge Discovery

Outcomes of Data Mining

Forecasting Future

Clustering Based On Attributes

Events Correlation – Association

Classification on Recognizing patterns

Sequencing Events ~ Later Predictions

Page 7: A Kit For Knowledge Discovery

Data Mining

Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data

Page 8: A Kit For Knowledge Discovery

Data Mining

+ =Data

Interestingnesscriteria

Hiddenpatterns

Page 9: A Kit For Knowledge Discovery

Data Mining

+ =Data

Interestingnesscriteria

Hiddenpatterns

Type of Patterns

Page 10: A Kit For Knowledge Discovery

Data Mining

+ =Data

Interestingnesscriteria

Hiddenpatterns

Type of data Type of Interestingness criteria

Page 11: A Kit For Knowledge Discovery

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

What is a Data Warehouse?

Page 12: A Kit For Knowledge Discovery

12

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

Data

Information

Page 13: A Kit For Knowledge Discovery

Data Mining Process

1. Problem Definition

2. Data Integration & Cleaning

3. Model Framing & Evaluation

4. Knowledge Discovery

3

2

1

4

Page 14: A Kit For Knowledge Discovery

Basic Operations in DM

Predictive:

Regression

Classification

Collaborative Filtering

Data Mining Task

Descriptive:

Clustering / Similarity Matching

Association rules

Deviation detection

Page 15: A Kit For Knowledge Discovery

Why Machine Learning

Growing flood of online data

Budding industry

Progress in algorithms and theory

• Data mining: using historical data to improve decision– medical records medical knowledge⇒

– log data to model user

• Software applications we can’t program by hand– autonomous driving

– speech recognition

• Self customizing programs– Newsreader that learns user interests

Page 16: A Kit For Knowledge Discovery

Machine Learning

Text

Unsupervised

Supervised

Unsupervised

Data have no target attribute. Explore Data to find Patterns

Machine Learning

Data Mining

Supervised

Discover patterns in the data.Presence of Target Attribute

Page 17: A Kit For Knowledge Discovery

Applications Of Data Mining

Page 18: A Kit For Knowledge Discovery

Applications of Data MiningFraud/Non-Compliance Anomaly detection

Isolate the factors that lead to fraud, waste and abuse

Target auditing and investigative efforts more effectively

Credit/Risk Scoring

Intrusion detection

Recruiting/Attracting customers

Maximizing profitability (cross selling, identifying profitable customers)

Service Delivery and Customer Retention

Build profiles of customers likely to use which services

Page 19: A Kit For Knowledge Discovery

Tools For Data Mining

LinkOut NCBI Sequin Rapid Miner LibSvm ADaM

etc….

Page 20: A Kit For Knowledge Discovery

Why Weka

Weka is a collection of machine learning algorithms for data

mining tasks.

The algorithms can either be applied directly to a dataset or

called from your own Java code.

Weka contains tools for data pre-processing, classification,

regression, clustering, association rules, and visualization.

It is also well-suited for developing new machine learning

schemes.

Page 21: A Kit For Knowledge Discovery

About WEKA

Waikato Environment for Knowledge Analysis (WEKA)

Developed by the Department of Computer Science, University of Waikato,

New Zealand

Machine learning/data mining software coded in Java

Used for research, education, and applications

Exclusively for KDD.

Various Versions are available such as Version 2.3, 1998; Version 3.0, 1999;

Version 3.4, 2003; Version 3.6, 2008.

Page 22: A Kit For Knowledge Discovery

Weka GUI Chooser

Page 23: A Kit For Knowledge Discovery

A Vital Part In Weka

ww.themegallery.com

Explorer

Page 24: A Kit For Knowledge Discovery

Weka !!!!!!!!

Weka is a collection of machine learning algorithms for data mining tasks.

The algorithms can either be applied directly to a dataset or called from

your own Java code.

Weka contains tools for data pre-processing, classification, regression,

clustering, association rules, and visualization.

Perfectly suited for developing new machine learning schemes.

Page 25: A Kit For Knowledge Discovery

Weka’s Structural Layout

Explorer

Experimenter Knowledge Flow

Simple CLI

An environment for exploring data with WEKA

Supports the same functionsas the Explorer but with drag-and-drop

Performing experiments and conductingstatistical tests between learning schemes

Provides a simple command-line interface that allows directexecution of WEKA

Page 26: A Kit For Knowledge Discovery

Algorithms

www.themegallery.com

Page 27: A Kit For Knowledge Discovery

WEKA ! File

WEKA stores data in flat files (ARFF format).

Easy to transform EXCEL file to ARFF format.

ARFF file consists of a list of instances

ARFF file can be created using Notepad or Word.

Name of the dataset is with @relation

Attribute information is with @attribute

Data is with @data.

Attribute Attribute Relation Relation

File File Format Format (ARFF)(ARFF)

Page 28: A Kit For Knowledge Discovery

Sample ARFF

Page 29: A Kit For Knowledge Discovery

Intrinsic Operations

Select Attributes

Associate

Cluster

Classify

Preprocess

55

44

33

22

1

Page 30: A Kit For Knowledge Discovery
Page 31: A Kit For Knowledge Discovery

Preprocessing

Changing Data formats as per the Needs.

Varies as Per Mining Datasets.

Some of the Preprocessing Steps

Adding/removing attributes

Attribute value substitution

Discretization (MDL, Kononenko, etc.)

Time series filters (delta, shift)

Sampling, randomization

Missing value management

Normalization and other numeric transformations

Page 32: A Kit For Knowledge Discovery

Algorithms

Page 33: A Kit For Knowledge Discovery

Pre-Processing

Browse for the datafile in local filesystem.

RelationsRelationsInstances Instances SchemaSchema

Attributes Attributes FiltersFilters

Opening Files Current Relation Operations

Page 34: A Kit For Knowledge Discovery

Weka – Formulating Files

Page 35: A Kit For Knowledge Discovery

Dataset -.txt Format

Page 36: A Kit For Knowledge Discovery

Weka ~ Dataset’s

Page 37: A Kit For Knowledge Discovery

Missing Values

Page 38: A Kit For Knowledge Discovery

GenericObjectEditor

A Property Editor for objects as editable in the

GenericObjectEditor configuration file, which lists possible

values that can be selected from, and themselves configured.

The configuration file is called "GenericObjectEditor.props"

and may live in either the location given by "user.home" or the

current directory (this last will take precedence), and a default

properties file is read from the weka distribution.

Page 39: A Kit For Knowledge Discovery

Weka ~ GenericObjectEditor

This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.

This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.

Page 40: A Kit For Knowledge Discovery

Sample - Cluster

Attributes for Cluster

Page 41: A Kit For Knowledge Discovery

Weka’s Viewer

Page 42: A Kit For Knowledge Discovery

PCA Analysis

Page 43: A Kit For Knowledge Discovery

Pre-Processing Retrievals

BeforeBefore AfterAfter

Page 44: A Kit For Knowledge Discovery

Retrieving Significant Attributes

Page 45: A Kit For Knowledge Discovery
Page 46: A Kit For Knowledge Discovery

Algorithms

Page 47: A Kit For Knowledge Discovery

Feature Selection

Some columns are noisy or redundant. This noise makes it more difficult to

discover meaningful patterns from the data;

To discover quality patterns, most data mining algorithms require much

larger training data set on high-dimensional data set.

Feature selection, also known as variable selection, feature

reduction, attribute selection or variable subset selection,

is the technique of selecting a subset of relevant features for building

robust learning models

Page 48: A Kit For Knowledge Discovery

Attribute Selection

Attribute selection involves searching through all possible combinations of

attributes in the data to find which subset of attributes works best for

prediction.

To do this, two objects must be set up:

The evaluator determines what method is used to assign a worth to each

subset of attributes.

The search method determines what style of search to be done

The Attribute Selection Mode box has two options:

1. Use full training set.

2. Cross-validation.

Page 49: A Kit For Knowledge Discovery

Attribute Selection

Very flexible: arbitrary combination of search and evaluation methods

Both filtering and wrapping methods Search methods

best-first genetic ranking ...

Evaluation mmeasures Relief information gain gain ratio ...

Page 50: A Kit For Knowledge Discovery

Applying Algorithm

Page 51: A Kit For Knowledge Discovery

Best Attribute

Page 52: A Kit For Knowledge Discovery

Algorithm……

Page 53: A Kit For Knowledge Discovery
Page 54: A Kit For Knowledge Discovery

Classification

Classification is a data mining function that assigns items in a collection to

target categories or classes.

The goal of classification is to accurately predict the target class for each

case in the data.

A classification task begins with a data set in which the class assignments

are known.

For example, a classification model that predicts credit risk could be

developed based on observed data for many loan applicants over a period of

time

Page 55: A Kit For Knowledge Discovery

Classification ~ Naive Bayes classifier  A naive Bayes classifier assumes that the presence (or absence) of a

particular feature of a class is unrelated to the presence (or absence) of any

other feature, given the class variable.

For example, a fruit may be considered to be an apple if it is red, round, and

about 4" in diameter.

Even if these features depend on each other or upon the existence of the other

features, a naive Bayes classifier considers all of these properties to

independently contribute to the probability that this fruit is an apple.

Page 56: A Kit For Knowledge Discovery

Naive Bayes Classifier

Page 57: A Kit For Knowledge Discovery

Confusion Matrix –Pervasive Role

Page 58: A Kit For Knowledge Discovery

Confusion Matrix - Dataset

Page 59: A Kit For Knowledge Discovery

Second Fold -Classification

Page 60: A Kit For Knowledge Discovery
Page 61: A Kit For Knowledge Discovery

Algorithms

Page 62: A Kit For Knowledge Discovery

Clustering

Clustering is the task of assigning a set of objects into groups

(called clusters) so that the objects in the same cluster are more similar (in

some sense or another) to each other than to those in other clusters.

Belong to Unsupervised Learning

Page 63: A Kit For Knowledge Discovery

Example ~ Weka

Page 64: A Kit For Knowledge Discovery

Attributes Replacements

Page 65: A Kit For Knowledge Discovery

Updations

Page 66: A Kit For Knowledge Discovery

K- Means

Page 67: A Kit For Knowledge Discovery

Visualizer

Open Saved File

Save File =>Will Store in ARFF

Page 68: A Kit For Knowledge Discovery

Visualizer – Samples

Page 69: A Kit For Knowledge Discovery
Page 70: A Kit For Knowledge Discovery

Association rules

Association rules are if/then statements that help uncover relationships

between seemingly unrelated data in a relational database or other

information repository.

Finding frequent patterns, associations, correlations, or causal structures

among sets of items or objects in transaction databases.

An example of an association rule would be "If a customer buys a dozen

eggs, he is 90% likely to also purchase milk.“

Market Basket Analysis

Page 71: A Kit For Knowledge Discovery

Association

Page 72: A Kit For Knowledge Discovery

Description

Page 73: A Kit For Knowledge Discovery

Rules Framing

Rules Set

Page 74: A Kit For Knowledge Discovery

Visualize

Page 75: A Kit For Knowledge Discovery

Result Analysis

WekaWeka

Result 2Result 2

Result 1 Result 1

ConceptConcept

Page 76: A Kit For Knowledge Discovery