Upload
ashish-sakpal
View
233
Download
0
Embed Size (px)
Citation preview
7/31/2019 Dataware Housing and Mining 16-Mar-06
1/30
Data Mining: An Overview from aDatabase Perspective
7/31/2019 Dataware Housing and Mining 16-Mar-06
2/30
7/31/2019 Dataware Housing and Mining 16-Mar-06
3/30
Why Data Mining?Potential Applications
Marketing
Corporate Analysis
Fraud Detection
Other Applications
7/31/2019 Dataware Housing and Mining 16-Mar-06
4/30
Marketing
Sales Analysis
associations between product sales
beer and diapers
Customer Profiling
data mining can tell you what types of customersbuy what products
Identifying Customer Requirementsidentify the best products for different customers
use prediction to find what factors will attractnew customers
7/31/2019 Dataware Housing and Mining 16-Mar-06
5/30
Corporate Analysis
Finances
cash flow analysis and prediction
Resources
summarize and compare the resources andspending
Competition
compare with other competitors bysummarizing data to the same level.
7/31/2019 Dataware Housing and Mining 16-Mar-06
6/30
Fraud Detection
Auto Insurance Fraud
Association Rule Mining can detect a groupof people who stage accidents to collect on
insurance
Money Laundering
Since 1993, the US Treasury's FinancialCrimes Enforcement Network agency hasused a data-mining application, to detectsuspicious money transactions
7/31/2019 Dataware Housing and Mining 16-Mar-06
7/30
Other Applications
Sports Teams
New York Knicks use data mining to gain a
competitive advantage
AstronomyCalifornia Institute of Technology and the Palomar
Observatory discovered 22 quasars with the help
of data mining
BankingSecurity Pacific/Bank of America uses data mining
to help with commercial lending decisions and to
prevent fraud
7/31/2019 Dataware Housing and Mining 16-Mar-06
8/30
Data Mining: Major Issues
Diversity of data mining tasks:
Summarization, characterization, association,
classification, clustering, trend and deviation
analysis, other pattern analysis. Diversity of data:
Relational, transactional, data warehouse,
spatial, text, multimedia, active, object-
oriented, Web, etc.
Efficiency and scalability
Expression and visualization of data mining results
Data mining applications, social issues (security and
7/31/2019 Dataware Housing and Mining 16-Mar-06
9/30
Data Mining: Classification
Different views, different classifications:
the kinds of knowledge to be mined
the kinds of database to be mined on
the kinds of techniques adopted
Knowledge to be mined: Summarization, characterization, association,
classification, clustering, trend and deviation analysis,
other pattern analysis.
Database to be mined on: Relational, transactional, data warehouse, spatial,
text, multimedia, active, object-oriented, Web, etc.
Techniques adopted:
Database statistics visualization machine learnin
7/31/2019 Dataware Housing and Mining 16-Mar-06
10/30
Data Mining: A KD Process
Data mining: thecore of knowledgediscovery process.
Data CleaningData Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
7/31/2019 Dataware Housing and Mining 16-Mar-06
11/30
From OLAP to OLAP Mining
Construction of data warehouse and computation ofdata cubes.
OLAP: On-Line Analytical Processing.
OLAP operations: drilling/rolling, pivoting,slicing/dicing, filtering, etc.
OLAP mining (OLAM): Integration of OLAP with datamining.
On-line interactive mining: Mining interwinedwith drilling, slicing and dicing, pivoting, etc.
Dynamic swapping mining tasks.w
7/31/2019 Dataware Housing and Mining 16-Mar-06
12/30
Why OLAP Mining?
Integration of data mining with data warehouse andOLAP technologies.
Necessity of mining knowledge and patterns atdifferent levels of abstraction by drilling/rolling,pivoting, slicing/dicing, etc.
Interactive characterization, comparison, association,classification, clustering, prediction.
Integration of different data mining functions, e.g.,characterized classification, first clustering and thenassociation, etc.
7/31/2019 Dataware Housing and Mining 16-Mar-06
13/30
Data Mining: OLAM Architecture
Database Data Warehouse
Meta DataData
Cube
OLAM
Engine OLAPEngine
User GUI API
Data Cube API
ODBC/OLEDB
7/31/2019 Dataware Housing and Mining 16-Mar-06
14/30
Mining Data Dispersion Characteristics
Data Dispersion Characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals:
Data dispersion: analyzed with multiple granularities ofprecision.
Boxplot or quantile analysis on sorted intervals.
Dispersion analysis on computed measures:
Folding measures into numerical dimensions.
Boxplot or quantile analysis on the transformed cube.
7/31/2019 Dataware Housing and Mining 16-Mar-06
15/30
Visualization of Data Dispersion:Boxplot Analysis
7/31/2019 Dataware Housing and Mining 16-Mar-06
16/30
Mining Discriminant Rules
Discrimination: Comparison of two or more classes Strategy:
Collect the relevant data respectively into the target classand the contrasting class
Generalize both classes to the same high level concepts,
Compare tuples with the same high level descriptions, Present for every tuple its description and two numbers
support - distribution within single class comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis: Find attributes (features) which best distinguish different
classes.
7/31/2019 Dataware Housing and Mining 16-Mar-06
17/30
Mining Association Rules
Assocation rule mining: Finding associations or correlations among a set of items or
objects in transaction databases, relational databases, anddata warehouses.
Applications: Basket data analysis, cross-marketing, catalog design, loss-
leader analysis, clustering, etc.
Examples.
Rule form: LHS RHS [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]
7/31/2019 Dataware Housing and Mining 16-Mar-06
18/30
Mining Different Kinds of AssociationRules
Boolean vs. quantitative associations
Association on discrete vs. continuous data
Sinlge dimension vs. multiple dimensional associations
E.g., association on items bought vs. on multiple predicates.
Single level vs. multiple-level analysis
E.g, what brandof beers is associated with what brand of diapers?
Simple vs. constraint-based
E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
Association vs. correlation analysis.
Association does not necessarily imply correlation.
7/31/2019 Dataware Housing and Mining 16-Mar-06
19/30
Classification
Data categorization based on a set of training objects.
Applications: credit approval, target marketing, medicaldiagnosis, treatment effectiveness analysis, etc.
Example: classify a set of diseases and provide thesymptoms which describe each class or subclass.
The classification task: Based on the features present in theclass_labeled training data, develop a description or model foreach class. It is used for
classification of future test data,
better understanding of each class, and
prediction of certain properties and behaviors. Data classification methods: Decision-trees (e.g., ID3, C4.5),
statistics, neural networks, rough sets, etc.
7/31/2019 Dataware Housing and Mining 16-Mar-06
20/30
Major Classification Methods
Decision tree-based classification: Training set vs test set or cross-validation Overfitting problem and tree pruning Boosting techniques.
Bayesian classification: Nave Bayesian classification Bayesian belief networks Boosting techniques (e.g., AdaBoosting).
Neural network approach:
Multi-layer networks and back-propagation. Genetic algorithms:
Genetic operators and fitness function selection.
7/31/2019 Dataware Housing and Mining 16-Mar-06
21/30
Three Categories of ClusteringTechniques
Partitioning-based: Basically enumerate various partitions and then score
them by some criterion. K-means, K-medoids, etc.
Hierarchy-based: Create a hierarchical decomposition of the set of data
(or objects) using some criterion.
Model-based:
A model is hypothesized for each of the clusters Find the best fit of that model to each other. E.g., Bayesian classification (AutoClass), Cobweb.
7/31/2019 Dataware Housing and Mining 16-Mar-06
22/30
Database Clustering Methods
CLARANS (Ng & Han94): An extension to k-medoid algorithm based on randomized search.
BIRCH (Zhang et al96): CF tree (a balanced
tree structure). DBSCAN (EKXS96): connects regions of
sufficiently high desity into clusters.
STING (WYM97): A hierarchical cell structurethat store statistical information.
CLIQUE (Agrawal et al98): Cluster highdimensional data.
7/31/2019 Dataware Housing and Mining 16-Mar-06
23/30
Time-Series Data Mining
Trend and deviation analysis Find trend (data evolution regularity) and deviations.
Regression analysis, visualization techniques.
Subsequence analysis: similarity search Subsequence matching: normalization + matching
Template specification: shape and macrospecification.
Sequential pattern analysis Sequential association rules
Periodicity analysis full periods vs. partial periods, cyclic association
7/31/2019 Dataware Housing and Mining 16-Mar-06
24/30
Similarity Search in Data Mining
Faloutsos et al. (1994) :
Extract features from each window
Fourier Transform & R*-tree structure.
Agrawal et al. (1995) :
Amplitude scaling, offset translation Distance is determined from the sequence
envelopes
Agrawal et al. (1995) : SDL pattern language to encode queries about
shapes
Jagadish et al. (1997) :
domain-independent framework
7/31/2019 Dataware Housing and Mining 16-Mar-06
25/30
Periodic Pattern Search in Time-RelatedData Sets
Full cycle analysis: Fourier transformation, other statistical analysis
methods
Fragment-wise cyclic behavior analysis: Example. Jack reads NY Times at every 9:00am.
Given (natural) periods vs. arbitray periods.
A data cube and OLAP-based technique: (Gong andHan98)
Cyclic association rules: Associations which form cycles.
Cyclic Association Rules (B. zden, S. Ramawamy, A.
7/31/2019 Dataware Housing and Mining 16-Mar-06
26/30
Systems for Data Warehousing
Arbor Software: Essbase Oracle: Express/Data-mart Suite.
Informix: Meta-Cube.
Cognos: PowerPlay
Redbrick Systems: Redbrick Warehouse
Microstrategy: DSS/Server Microsoft: PLATO (SQL-Server 7.0)
[OLEDB for OLAP]
7/31/2019 Dataware Housing and Mining 16-Mar-06
27/30
Systems for Data Mining
IBM: Intelligent Miner. SAS Institute: Enterprise Miner. Silicon Graphics: MineSet. Integral Solutions Ltd.: Clementine. Information Discovery Inc.: Data Mining
Suite. DBMiner Technology Inc.: DBMiner Rutger: DataMine, GMD: Explora, Univ.
Munich: VisDB
7/31/2019 Dataware Housing and Mining 16-Mar-06
28/30
Major Approaches in Data MiningSystems
Database-oriented approach: IBM IntelligentMiner.
OLAM approach: DBMiner.
Machine learning: AQ15, ID3/C4.5/C5.0,Cobweb.
Rough sets, fuzzy sets: Datalogic/R, 49er, etc.
Statistical approaches, e.g., SAS EnterpriseMiner.
Neural network approach: Cognos 4thoughts.
7/31/2019 Dataware Housing and Mining 16-Mar-06
29/30
Conclusions
Data Mining: A rich, promising, young field with broad applicationsand many challenging research issues.
Data mining tasks: characterization, association, classification,clustering, prediction, sequence and pattern analysis, etc.
Data mining domains: relational, transactional, text, spatial, time-
series, multimedia, active DBs, data warehouses, and WWW. Data mining methods: Data-intensive, statistics, visualization,
information science, and other disciplines.
Progress: Scalable methods and multi-task systems.
OLAM: On-line analytical mining provides a high promise forintegration of OLAP and mining.
7/31/2019 Dataware Housing and Mining 16-Mar-06
30/30
Future Work
Theoretical foundations of data mining.
Implementation and new data mining methodologies: A set of well-tuned, standard mining operators.
Data and knowledge visualization tools.
Integration of multiple data mining strategies. Data mining in advanced information systems:
Spatial, multimedia, Web-mining
Data mining applications:
content browsing, query optimization, multi-resolution model, etc.
Social issues: A threat to security and privacy.