Upload
sas-sndp-yogam-college-konni
View
582
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
1
INTRODUCTION
A young, fast growing and promising field
2
INTRODUCTION
Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or KDD)
Extracting hidden information An interdisciplinary subfield of computer
science The computational process of discovering
patterns in large data sets Involving methods at the intersection of
Artificial intelligence, Machine learning, Statistics, and Database systems.
3
INTORODUCTION(CONTD..)
• The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
Aside from the raw analysis step, it involves• database and data management aspects
data pre-processing model inference considerations
• complexity considerations, post-processing of discovered structures, visualization, and online updating.
4
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Eg: Global backbone telecommunication network carry tens of
petabytes everyday
(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras,…
5
Why Data Mining?
“Necessity is the mother of invention” - Data mining—Automated analysis of massive data sets
6
What Motivated Data Mining?
We are drowning in data, but starving for knowledge!
7
Evolution of Database Technology
Data mining can be viewed as a result of natural evolution of IT
1960s: Data collection, database creation and network DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering,
etc.)
8
Evolution of Database Technology 1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and
global information systems
9
10
What Is Data Mining?
Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems
11
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
AlgorithmOther
Disciplines
Visualization
12
Knowledge Discovery (KDD) Process
Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
13
Knowledge Process
1. Data cleaning – to remove noise and inconsistent data
2. Data integration – to combine multiple source 3. Data selection – to retrieve relevant data for analysis4. Data transformation – to transform data into
appropriate form for data mining5. Data mining- An essential process where intelligent
methods are applied to extract data patterns6. Pattern Evaluation-Identify truly interesting patterns
representing knowledge based on interestingness measure
7. Knowledge presentation-visualization and representation techniques
14
Example: A Web Mining Framework
Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored
into knowledge-base
15
16
Data Mining in Business Intelligence
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
Decision Making
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems
17
KDD Process: A Typical View from ML and Statistics
Input Data Data Mining
Data Pre-Processing
Post-Processing
This is a view from typical machine learning and statistics communities
Data integrationNormalizationFeature selectionDimension reduction
Pattern discoveryAssociation & correlationClassificationClusteringOutlier analysis… … … …
Pattern evaluationPattern selectionPattern interpretationPattern visualization
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications Relational database, data warehouse, transactional database
Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data Multimedia database Text databases The World-Wide Web
18
RDBMS
A database that has a collection of tables of data items, all of which is formally described and organized according to the relational model.
Data in a single table represents a relation.
Each table schema must identify a column or group of columns, called the primary key, to uniquely identify each row.
A relationship can then be established between each row in the table and a row in another table by creating a foreign key, a column or group of columns in one table that points to the primary key of another table.
19
RDBMS
• Database normalization: The relational model offers various levels of refinement of table organization and reorganization .
• DBMS of a relational database is called an RDBMS, and is the software of a relational database.
• The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory.
• Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules.
• A relational database has become the predominant choice in storing data.
20
Relational database terminology.
A relation is defined as a set of tuples that have the same attributes
21
RDMS(contd..)
Example :Allelectronics(Company described by relation tables:Customer,item,employee and branch)
Relation : customer is a group of entities describing the customer information(Cust_id,cust_name, Age,Occupation,annual income, credit information and category)Tables: used to represent the relationship between or among multiple entities Database queries(SQL): For data accessing using relational operations such as join, selection and projection
22
Mining Relational databases
Can go further by searching for trends or data patterns
Examples Analyze customer data to predict the risk of
customers based on their income ,age Detect deviations: sales comparison with
previous year RDBMS are one of the most commonly available
and richest information repositories for data mining
23
24
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately
from the organization’s operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
DATA WAREHOUSES
Is a repository of information collected from multiple sources, stored under a unified schema.
Constructed via Data cleaning Data integration Data transformation Data Loading and periodic data
refreshing
25
26
Data warehouse is modeled by a multidimensional data structure
Data cube: precomputation &fast access of summarized data Each dimension corresponds to an attribute or a set of
attributes in a schema Each cell stores the value of some aggregate measure
(count, sum etc) Example: In Allelectronics the cube has three dimension :• Address(with city values, U S A, Canada, Mexico)• Time (with quarter values Q1,Q2,Q3,Q4)• Item(with type values )
DATA WAREHOUSES(contd…)
27
28
Multidimensional Data
Sales volume as a function of product, month, and region
Product
Region
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
29
A Sample Data Cube
Total annual salesof TVs in U.S.A.
Date
Produ
ct
Cou
ntrysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Data mining functionalities
Tasks can be classified : Predictive(makes prediction about values of data using
known results found from different data) Descriptive( characterize properties of a target data set)
Explore the properties of the data examined
Data mining functionalities are used to specify the kinds of patterns
Characterization and Discrimination The mining of frequent patterns, associations and
correlations Classification and regression Cluster analysis Outlier analysis
30
Characterization and Discrimination
Data characterization is a summarization of the general characteristics or features of a target class of data
Output of characterization can be presented in various forms Pie charts Bar charts Curves multidimensional data cube Multidimensional tablesDescriptions presented in generalized relations- Characteristic
rulesExample: In Allelectronics : Summarize the characteristic of
customers who spend more than $5000 a year at Allelectronics
this can be view in any dimension, such as on occupation to view these customers according to their type of employment.
31
Data Discrimination
Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or more multiple contrasting class
Output representation similar to characterization description
Discrimination description expressed in the form of rules –Discrimination rules
Target and contrasting class specified by the userExample: User want to compare the general features of software products with
sales that increased by 10% and decreased by 30% during the same period
32
Mining Frequent Patterns, Associations, Correlations
Frequent pattern Frequent item sets(Milk, bread) Frequent subsequences(Latop ,digital
camera ,memory card)
Frequent sub structures (graphs ,trees)Mining frequent patterns leads to the
discovery of interesting associations and correlation within data.
33
Association analysis(example)Item frequently purchased together
buys(X, ”computer”) =>buys(X, ”software”) [support=1%, confidence=50%]
X - a variable representing a customer A confidence or certainty – 50%(chance)
1%(under analysis)Association rule- with single-dimension association rules
“computer => software[1%,50%]”.Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”) [support=2%, confidence=60%] (Multidimensional association rule)
34
Classification and Regression for Predictive Analysis
Classification: the process of finding a model(function)that describes and distinguishes data classes or concepts
Model derived from analysis of a set of training data
Models are represented as Classification rules(IF-THEN rules) Decision trees Mathematical formulae or Neural networks
Regression: Statistical methodology for numeric prediction
35
Cluster Analysis and Outlier Analysis Cluster Analysis:
Determining similarity among data on predefined attributes
The most similar data are grouped into clusters
Outlier Analysis Outliers: The dataset contain objects that do
not required for the model of the data Analysis of outlier data is referred to as
Outlier Analysis or Anomaly mining Detected using statstical tests
36
37
Which Technologies Are Used?
Data Mining
MachineLearning
Statistics
Applications
Algorithm
PatternRecognition
High-PerformanceComputing
Visualization
Database Technology
Potential Applications of Data Mining Where there are data there
are data mining applications Data analysis and decision support ( Business Intelligence)
Market analysis and management Target marketing, customer relationship management
(CRM), market basket analysis, cross selling, market segmentation
Risk analysis and management Forecasting, customer retention, improved underwriting,
quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers)
Other Applications Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis
38
39
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
40
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining