© P. Giorgini, F. Dalpiaz 1

Data Mining – Day 1

Fabiano Dalpiaz

Department of Information and

Communication Technology

University of Trento - Italy

http://www.dit.unitn.it/~dalpiaz

Database e Business Intelligence

A.A. 2007-2008

© P. Giorgini, F. Dalpiaz 2

Acknowledgements

This presentation is partially based on the slides for the book:

Data Mining: Concepts and Techniques, 2° edJiawei Han and Micheline Kamber


Two-days outline

Data Mining and KDD Why Data Mining Applications of Data Mining Data Preprocessing Data Mining techniques Visualization of the results Summary


Data Mining and KDD

KDD ConferenceLogo


Looking for knowledge

The Explosive Growth of Data

The World Wide Web

Business: e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation

Society and everyone: news, digital cameras, YouTube, forums,

blogs, Google & Co

We are drowning in data, but starving for knowledge!

Avoid data tombs

“Necessity is the mother of invention”—Data mining—Automated

analysis of massive data sets.


What is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Are simple search engines data mining? Are queries data mining? Are expert systems data mining?


Knowledge Discovery (KDD) Process

Data sources

Data Cleaning

Data Warehouse

Data Mining

Pattern Evaluation

Selection

Data Integration

Task-relevant Data


Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions

End User

Business Analyst

DataAnalyst

DBA

Decision

MakingData Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Quantity of data


Data Mining: confluence of multiple disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmsOther

Disciplines

Visualization


Why Data Mining?


Why is Data Mining so complex? A matter of data dimensions Tremendous amount of data

Walmart – Customer buying patterns – a data warehouse 7.5 Terabytes large in 1995

VISA – Detecting credit card interoperability issues – 6800 payment transactions per second

High-dimensionality of data Many dimensions to be combined together Data cube example: time, location, product sales

High complexity of data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Spatial, spatiotemporal, multimedia, text and Web data


What does Data Mining provide me with? (1)

Multidimensional concept description: Characterization and

discrimination

Generalize, summarize, and contrast data characteristics, e.g.,

dry vs. wet regions

Characterization describes things in the same class,

discrimination describes how to separate different classes

Frequent patterns, association, correlation vs. causality

Wine Spaghetti [0.3% of all basket cases, 75% of cases

when tomato sauce is bought]

Is this correlation or not?



Classification and prediction

Construct models (functions) that describe and distinguish

classes or concepts for future prediction

E.g., classify countries based on climate, or classify cars

based on gas mileage

Predict some unknown or missing numerical values Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

Maximizing intra-class similarity & minimizing interclass similarity



Outlier analysis Outlier: Data object that does not comply with the general

behavior of the data Fraud detection is the main application area Noise or exception?

Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD

memory Periodicity analysis Similarity-based analysis


Applications of Data MiningMarket Analysis and Management Data sources:

credit card transactions, loyalty cards, smart cards, discount coupons, ...

Target marketing Find clusters of “model” customers who share the same

characteristics: • Geographics (lives in Rome, lives in Trentino)

• Demographics (married, between 21-35, at least one child, family income more than 40.000€/year)

• Psychographics (likes new products, consistently uses the Web)

• Behaviors (searches info in Internet, always defends her decisions)

Determine customer purchasing patterns over time


Applications of Data MiningMarket Analysis and Management Cross-market analysis

Find associations between product sales, and predict based on such association

Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success

Customer profiling What types of customers buy what products Customers with age between 20-30 and income > 20K€ will buy

product A Customer requirement analysis

Identify the best products for different groups of customers Predict what factors will attract new customers


Applications of Data MiningCorporate Analysis Finance Planning and Asset Evaluation

Cash flow prediction and analysis Cross-sectional and time-series analysis (financial ratio, trend

analysis)

Resource Planning summarize and compare the resources and spending

Competition monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market

Other examples?


What’s next? Data Preprocessing

Why is it needed? Data cleaning Data integration and transformation, Data reduction Discretization and Concept hiererchy

Data Mining techniques Frequent patterns, association rules Classification and prediction Cluster Analysis

Visualization of the results Summary

Are you sleeping?


Data Preprocessing


Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data• e.g., occupation=“ ”, birthdate=“31/12/2099”

noisy: containing errors or outliers• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!)• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records. In one copy of the data

customer A has to pay 200.000€, in the second copy of the data A does not have to pay anything.


Why is data dirty? Incomplete data may come from

“Not applicable” data value when collected Different considerations between the time when the data was

collected and when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)


Why Is Data Preprocessing Important?


Data Preprocessing1. Data cleaning – missing values

“Data cleaning is one of the three biggest problems in data warehousing”— Ralph Kimball

Fill in missing values Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“” Ignore the record (is it always feasible?) Manually filling missing attributes Automatically insert a constant Automatically insert the mean value (relative to the record class) Most probable value: make some inference!


Data Preprocessing1. Data cleaning – binning Handle noisy data

Binning, clustering, regression (not details)

Binning

1. Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26

2. Partition into equal-frequency (equi-depth) bins: Bin 1: 4, 8, 9 Bin 2: 15, 21, 21 Bin 3: 24, 25, 26

3. Smoothing by bin means: Bin 1: 7, 7, 7 Bin 2: 19, 19, 19 Bin 3: 25, 25, 25


Data Preprocessing1. Data cleaning – clustering

noise


Data Preprocessing2. Integration and transformation

Data Integration combines data from multiple sources into a coherent store

Schema integration Integrate metadata from different sources A.cust-id B.cust-number

Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill

Clinton = William Clinton

Detecting and resolving data value conflicts For the same real world entity, attribute values from different

sources are different (e.g., cm vs. inch)

D1 D2 D3

D1,2,3



Data integration can lead to redundant attributes Same object (A.house = B.residence) Derivates (A.annualIncome = B.salary+C.rentalIncome)

Redundant attributes can be discoverd via correlation analysis A mathematical method detecting the correletion between two

attributes Correlation coefficient (Pearson’s product moment coefficient):

the higher it is, the stronger the correlation between attributes Χ2 (chi-square) test No details on these methods here



Aggregation: Sum the sales of different branches (in different data sources) to

compute the company sales

Generalization: concept hierarchy climbing From integer attribute age to classes of age (children, adult, old)

Normalization: scaled to fall within a small, specified range Change the range from [-∞,+ ∞] to [-1,+1] {-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1}


Data Preprocessing3. Data reduction

Data reduction Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the same) analytical results

Different reduction types (dimensions, numerosity, discretization)

Dimensionality: Attribute subset selection Example with a decision tree (left branches True, right False)

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A1? A6?

Class 1

A4?

Class 1Class 2 Class 2

Reduced attribute set: {A1, A4, A6}


Data Preprocessing3. Data reduction Dimensionality: Principal Components Analysis

Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data

Works for numeric data only Used when the number of dimensions is large

Numerosity: Clustering Partition data set into clusters based on similarity, and store

cluster representation (e.g., centroid and diameter) only

2 clustersSparse data leadsto many clusters – non effective


Data Preprocessing3. Data reduction

Numerosity: Sampling obtaining a small sample s to represent the whole data set N Problem: How to select a representative sampling set Random sampling is not enough – representative samples

should be preserved Stratified sampling: Approximate the percentage of each class

(or subpopulation of interest) in the overall database

No samples from here

Random sampling Stratified sampling


Data Preprocessing4. Discretization - concept hierarchy Three types of attributes

Nominal — values from an unordered set (color, profession) Ordinal — values from an ordered set (military or academic rank) Continuous — numbers (integer or real numbers)

Discretization Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types, and in

those cases discretization is mandatory

Some useful methods: Binning, clustering (already presented) Entropy-based discretization (no details here)


Data Preprocessing 4. Discretization - concept hierarchy

Concept hierarchy generation For categorical data Specification of an ordering between attributes (schema level)

• street < city < state < country

Specification of a hierarchy of values (data level)• {Urbana, Champaign, Chicago} < Illinois

Automatic generation using the number of distinct values• For the set of attributes: {street, city, state, country}

• IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15

• THEN: street < city < state < country


Day 1 Summary Data Mining and KDD Why Data Mining Applications of Data Mining Data Preprocessing

Data Cleaning Data Integration and Transformation Data Reduction Discretization and concept hierarchy

Tomorrow? Data Mining techniques Results visualization Summary

Questions?

Documents

© P. Giorgini, F. Dalpiaz 1