Upload
ora
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A Kit For Knowledge Discovery. Data, Data everywhere yet. I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented - PowerPoint PPT Presentation
Citation preview
A Kit For Knowledge Discovery
2
Data, Data everywhere yet ...
I can’t find the data I need data is scattered over the network many versions, subtle differences
I can’t get the data I need need an expert to get the data
I can’t understand the data I found available data poorly documented
I can’t use the data I found results are unexpected data needs to be transformed from one
form to other
?• There are sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (e.g., patterns) in data.
• Achieving Standardized Process Model
What is KDD ?
1
Legitimate
Innovative
2
Probably
useful
3
Accurate understandable patterns in data.
Knowledge Discovery in Data is the significant
method of evaluating
______
______
______
Transformed Data
Patternsand Rules
Target Data
RawData
KnowledgeData MiningTransformation
Interpretation& Evaluation
Selection& Cleaning
Integration Understan
ding
Knowledge Discovery Process
DATAWarehouse
Knowledge
Outcomes of Data Mining
Forecasting Future
Clustering Based On Attributes
Events Correlation – Association
Classification on Recognizing patterns
Sequencing Events ~ Later Predictions
Data Mining
Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of Patterns
Data Mining
+ =Data
Interestingnesscriteria
Hiddenpatterns
Type of data Type of Interestingness criteria
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
What is a Data Warehouse?
12
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
Data
Information
Data Mining Process
1. Problem Definition
2. Data Integration & Cleaning
3. Model Framing & Evaluation
4. Knowledge Discovery
3
2
1
4
Basic Operations in DM
Predictive:
Regression
Classification
Collaborative Filtering
Data Mining Task
Descriptive:
Clustering / Similarity Matching
Association rules
Deviation detection
Why Machine Learning
Growing flood of online data
Budding industry
Progress in algorithms and theory
• Data mining: using historical data to improve decision– medical records medical knowledge⇒
– log data to model user
• Software applications we can’t program by hand– autonomous driving
– speech recognition
• Self customizing programs– Newsreader that learns user interests
Machine Learning
Text
Unsupervised
Supervised
Unsupervised
Data have no target attribute. Explore Data to find Patterns
Machine Learning
Data Mining
Supervised
Discover patterns in the data.Presence of Target Attribute
Applications Of Data Mining
Applications of Data MiningFraud/Non-Compliance Anomaly detection
Isolate the factors that lead to fraud, waste and abuse
Target auditing and investigative efforts more effectively
Credit/Risk Scoring
Intrusion detection
Recruiting/Attracting customers
Maximizing profitability (cross selling, identifying profitable customers)
Service Delivery and Customer Retention
Build profiles of customers likely to use which services
Tools For Data Mining
LinkOut NCBI Sequin Rapid Miner LibSvm ADaM
etc….
Why Weka
Weka is a collection of machine learning algorithms for data
mining tasks.
The algorithms can either be applied directly to a dataset or
called from your own Java code.
Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization.
It is also well-suited for developing new machine learning
schemes.
About WEKA
Waikato Environment for Knowledge Analysis (WEKA)
Developed by the Department of Computer Science, University of Waikato,
New Zealand
Machine learning/data mining software coded in Java
Used for research, education, and applications
Exclusively for KDD.
Various Versions are available such as Version 2.3, 1998; Version 3.0, 1999;
Version 3.4, 2003; Version 3.6, 2008.
Weka GUI Chooser
A Vital Part In Weka
ww.themegallery.com
Explorer
Weka !!!!!!!!
Weka is a collection of machine learning algorithms for data mining tasks.
The algorithms can either be applied directly to a dataset or called from
your own Java code.
Weka contains tools for data pre-processing, classification, regression,
clustering, association rules, and visualization.
Perfectly suited for developing new machine learning schemes.
Weka’s Structural Layout
Explorer
Experimenter Knowledge Flow
Simple CLI
An environment for exploring data with WEKA
Supports the same functionsas the Explorer but with drag-and-drop
Performing experiments and conductingstatistical tests between learning schemes
Provides a simple command-line interface that allows directexecution of WEKA
Algorithms
www.themegallery.com
WEKA ! File
WEKA stores data in flat files (ARFF format).
Easy to transform EXCEL file to ARFF format.
ARFF file consists of a list of instances
ARFF file can be created using Notepad or Word.
Name of the dataset is with @relation
Attribute information is with @attribute
Data is with @data.
Attribute Attribute Relation Relation
File File Format Format (ARFF)(ARFF)
Sample ARFF
Intrinsic Operations
Select Attributes
Associate
Cluster
Classify
Preprocess
55
44
33
22
1
Preprocessing
Changing Data formats as per the Needs.
Varies as Per Mining Datasets.
Some of the Preprocessing Steps
Adding/removing attributes
Attribute value substitution
Discretization (MDL, Kononenko, etc.)
Time series filters (delta, shift)
Sampling, randomization
Missing value management
Normalization and other numeric transformations
Algorithms
Pre-Processing
Browse for the datafile in local filesystem.
RelationsRelationsInstances Instances SchemaSchema
Attributes Attributes FiltersFilters
Opening Files Current Relation Operations
Weka – Formulating Files
Dataset -.txt Format
Weka ~ Dataset’s
Missing Values
GenericObjectEditor
A Property Editor for objects as editable in the
GenericObjectEditor configuration file, which lists possible
values that can be selected from, and themselves configured.
The configuration file is called "GenericObjectEditor.props"
and may live in either the location given by "user.home" or the
current directory (this last will take precedence), and a default
properties file is read from the weka distribution.
Weka ~ GenericObjectEditor
This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.
This Editor allows configure a filter.Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.
Sample - Cluster
Attributes for Cluster
Weka’s Viewer
PCA Analysis
Pre-Processing Retrievals
BeforeBefore AfterAfter
Retrieving Significant Attributes
Algorithms
Feature Selection
Some columns are noisy or redundant. This noise makes it more difficult to
discover meaningful patterns from the data;
To discover quality patterns, most data mining algorithms require much
larger training data set on high-dimensional data set.
Feature selection, also known as variable selection, feature
reduction, attribute selection or variable subset selection,
is the technique of selecting a subset of relevant features for building
robust learning models
Attribute Selection
Attribute selection involves searching through all possible combinations of
attributes in the data to find which subset of attributes works best for
prediction.
To do this, two objects must be set up:
The evaluator determines what method is used to assign a worth to each
subset of attributes.
The search method determines what style of search to be done
The Attribute Selection Mode box has two options:
1. Use full training set.
2. Cross-validation.
Attribute Selection
Very flexible: arbitrary combination of search and evaluation methods
Both filtering and wrapping methods Search methods
best-first genetic ranking ...
Evaluation mmeasures Relief information gain gain ratio ...
Applying Algorithm
Best Attribute
Algorithm……
Classification
Classification is a data mining function that assigns items in a collection to
target categories or classes.
The goal of classification is to accurately predict the target class for each
case in the data.
A classification task begins with a data set in which the class assignments
are known.
For example, a classification model that predicts credit risk could be
developed based on observed data for many loan applicants over a period of
time
Classification ~ Naive Bayes classifier A naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any
other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and
about 4" in diameter.
Even if these features depend on each other or upon the existence of the other
features, a naive Bayes classifier considers all of these properties to
independently contribute to the probability that this fruit is an apple.
Naive Bayes Classifier
Confusion Matrix –Pervasive Role
Confusion Matrix - Dataset
Second Fold -Classification
Algorithms
Clustering
Clustering is the task of assigning a set of objects into groups
(called clusters) so that the objects in the same cluster are more similar (in
some sense or another) to each other than to those in other clusters.
Belong to Unsupervised Learning
Example ~ Weka
Attributes Replacements
Updations
K- Means
Visualizer
Open Saved File
Save File =>Will Store in ARFF
Visualizer – Samples
Association rules
Association rules are if/then statements that help uncover relationships
between seemingly unrelated data in a relational database or other
information repository.
Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases.
An example of an association rule would be "If a customer buys a dozen
eggs, he is 90% likely to also purchase milk.“
Market Basket Analysis
Association
Description
Rules Framing
Rules Set
Visualize
Result Analysis
WekaWeka
Result 2Result 2
Result 1 Result 1
ConceptConcept