Introduction to DataMining

1

INTRODUCTION

A young, fast growing and promising field

2

INTRODUCTION

Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or KDD)

Extracting hidden information An interdisciplinary subfield of computer

science The computational process of discovering

patterns in large data sets Involving methods at the intersection of

Artificial intelligence, Machine learning, Statistics, and Database systems.

3

INTORODUCTION(CONTD..)

• The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

Aside from the raw analysis step, it involves• database and data management aspects

data pre-processing model inference considerations

• complexity considerations, post-processing of discovered structures, visualization, and online updating.

4

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

Eg: Global backbone telecommunication network carry tens of

petabytes everyday

(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)

Data collection and data availability

Automated data collection tools, database systems, Web,

computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras,…

5

Why Data Mining?

“Necessity is the mother of invention” - Data mining—Automated analysis of massive data sets

6

What Motivated Data Mining?

We are drowning in data, but starving for knowledge!

7

Evolution of Database Technology

Data mining can be viewed as a result of natural evolution of IT

1960s: Data collection, database creation and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering,

etc.)

8

Evolution of Database Technology 1990s:

Data mining, data warehousing, multimedia databases, and Web databases

2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and

global information systems

9

10

What Is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems

11

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

12

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

13

Knowledge Process

1. Data cleaning – to remove noise and inconsistent data

2. Data integration – to combine multiple source 3. Data selection – to retrieve relevant data for analysis4. Data transformation – to transform data into

appropriate form for data mining5. Data mining- An essential process where intelligent

methods are applied to extract data patterns6. Pattern Evaluation-Identify truly interesting patterns

representing knowledge based on interestingness measure

7. Knowledge presentation-visualization and representation techniques

14

Example: A Web Mining Framework

Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored

into knowledge-base

15

16

Data Mining in Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

17

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

This is a view from typical machine learning and statistics communities

Data integrationNormalizationFeature selectionDimension reduction

Pattern discoveryAssociation & correlationClassificationClusteringOutlier analysis… … … …

Pattern evaluationPattern selectionPattern interpretationPattern visualization

Data Mining: On What Kinds of Data?

Database-oriented data sets and applications Relational database, data warehouse, transactional database

Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data Multimedia database Text databases The World-Wide Web

18

RDBMS

A database that has a collection of tables of data items, all of which is formally described and organized according to the relational model.

Data in a single table represents a relation.

Each table schema must identify a column or group of columns, called the primary key, to uniquely identify each row.

A relationship can then be established between each row in the table and a row in another table by creating a foreign key, a column or group of columns in one table that points to the primary key of another table.

19

RDBMS

• Database normalization: The relational model offers various levels of refinement of table organization and reorganization .

• DBMS of a relational database is called an RDBMS, and is the software of a relational database.

• The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory.

• Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules.

• A relational database has become the predominant choice in storing data.

20

Relational database terminology.

A relation is defined as a set of tuples that have the same attributes

21

RDMS(contd..)

Example :Allelectronics(Company described by relation tables:Customer,item,employee and branch)

Relation : customer is a group of entities describing the customer information(Cust_id,cust_name, Age,Occupation,annual income, credit information and category)Tables: used to represent the relationship between or among multiple entities Database queries(SQL): For data accessing using relational operations such as join, selection and projection

22

Mining Relational databases

Can go further by searching for trends or data patterns

Examples Analyze customer data to predict the risk of

customers based on their income ,age Detect deviations: sales comparison with

previous year RDBMS are one of the most commonly available

and richest information repositories for data mining

23

24

What is a Data Warehouse?

Defined in many different ways, but not rigorously.

A decision support database that is maintained separately

from the organization’s operational database

Support information processing by providing a solid

platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of

management’s decision-making process.”—W. H. Inmon

Data warehousing:

The process of constructing and using data warehouses

DATA WAREHOUSES

Is a repository of information collected from multiple sources, stored under a unified schema.

Constructed via Data cleaning Data integration Data transformation Data Loading and periodic data

refreshing

25

26

Data warehouse is modeled by a multidimensional data structure

Data cube: precomputation &fast access of summarized data Each dimension corresponds to an attribute or a set of

attributes in a schema Each cell stores the value of some aggregate measure

(count, sum etc) Example: In Allelectronics the cube has three dimension :• Address(with city values, U S A, Canada, Mexico)• Time (with quarter values Q1,Q2,Q3,Q4)• Item(with type values )

DATA WAREHOUSES(contd…)

27

28

Multidimensional Data

Sales volume as a function of product, month, and region

Product

Region

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

29

A Sample Data Cube

Total annual salesof TVs in U.S.A.

Date

Produ

ct

Cou

ntrysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Data mining functionalities

Tasks can be classified : Predictive(makes prediction about values of data using

known results found from different data) Descriptive( characterize properties of a target data set)

Explore the properties of the data examined

Data mining functionalities are used to specify the kinds of patterns

Characterization and Discrimination The mining of frequent patterns, associations and

correlations Classification and regression Cluster analysis Outlier analysis

30

Characterization and Discrimination

Data characterization is a summarization of the general characteristics or features of a target class of data

Output of characterization can be presented in various forms Pie charts Bar charts Curves multidimensional data cube Multidimensional tablesDescriptions presented in generalized relations- Characteristic

rulesExample: In Allelectronics : Summarize the characteristic of

customers who spend more than $5000 a year at Allelectronics

this can be view in any dimension, such as on occupation to view these customers according to their type of employment.

31

Data Discrimination

Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or more multiple contrasting class

Output representation similar to characterization description

Discrimination description expressed in the form of rules –Discrimination rules

Target and contrasting class specified by the userExample: User want to compare the general features of software products with

sales that increased by 10% and decreased by 30% during the same period

32

Mining Frequent Patterns, Associations, Correlations

Frequent pattern Frequent item sets(Milk, bread) Frequent subsequences(Latop ,digital

camera ,memory card)

Frequent sub structures (graphs ,trees)Mining frequent patterns leads to the

discovery of interesting associations and correlation within data.

33

Association analysis(example)Item frequently purchased together

buys(X, ”computer”) =>buys(X, ”software”) [support=1%, confidence=50%]

X - a variable representing a customer A confidence or certainty – 50%(chance)

1%(under analysis)Association rule- with single-dimension association rules

“computer => software[1%,50%]”.Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”) [support=2%, confidence=60%] (Multidimensional association rule)

34

Classification and Regression for Predictive Analysis

Classification: the process of finding a model(function)that describes and distinguishes data classes or concepts

Model derived from analysis of a set of training data

Models are represented as Classification rules(IF-THEN rules) Decision trees Mathematical formulae or Neural networks

Regression: Statistical methodology for numeric prediction

35

Cluster Analysis and Outlier Analysis Cluster Analysis:

Determining similarity among data on predefined attributes

The most similar data are grouped into clusters

Outlier Analysis Outliers: The dataset contain objects that do

not required for the model of the data Analysis of outlier data is referred to as

Outlier Analysis or Anomaly mining Detected using statstical tests

36

37

Which Technologies Are Used?

Data Mining

MachineLearning

Statistics

Applications

Algorithm

PatternRecognition

High-PerformanceComputing

Visualization

Database Technology

Potential Applications of Data Mining Where there are data there

are data mining applications Data analysis and decision support ( Business Intelligence)

Market analysis and management Target marketing, customer relationship management

(CRM), market basket analysis, cross selling, market segmentation

Risk analysis and management Forecasting, customer retention, improved underwriting,

quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers)

Other Applications Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis

38

39

Major Issues in Data Mining (1)

Mining Methodology

Mining various and new kinds of knowledge

Mining knowledge in multi-dimensional space

Data mining: An interdisciplinary effort

Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining

User Interaction

Interactive mining

Incorporation of background knowledge

Presentation and visualization of data mining results

40

Major Issues in Data Mining (2)

Efficiency and Scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed, stream, and incremental mining methods

Diversity of data types

Handling complex types of data

Mining dynamic, networked, and global data repositories

Data mining and society

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining

Education

Introduction to DataMining