31
10/30/02 1 ME ME DATA MINING OVERVIEW DATA MINING OVERVIEW Margaret H. Dunham Margaret H. Dunham CSE Department CSE Department Southern Methodist University Southern Methodist University Dallas, Texas 75275 Dallas, Texas 75275 [email protected]. edu

10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 [email protected]

Embed Size (px)

Citation preview

Page 1: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 1

MEMEDATA MINING OVERVIEWDATA MINING OVERVIEW

Margaret H. DunhamMargaret H. Dunham

CSE DepartmentCSE Department

Southern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

[email protected]

Page 2: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 2

Data is growing at a phenomenal rate Users expect more sophisticated

information How?

UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATION

DATA MININGDATA MINING

Page 3: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 3

Data Mining Definition

Finding hidden information in a database Fit data to a model Similar terms

Exploratory data analysis Data driven discovery Deductive learning

Page 4: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 4

Database Processing vs. Data Mining Processing

QueryQuery Well definedWell defined SQLSQL

QueryQueryPoorly definedPoorly definedNo precise query languageNo precise query language

DataData Operational dataOperational data

OutputOutput PrecisePrecise Subset of databaseSubset of database

DataData Not operational dataNot operational data

OutputOutput FuzzyFuzzy Not a subset of databaseNot a subset of database

Page 5: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 5

Data Mining Development

Page 6: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 6

KDD Process

Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format.

Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in

meaningful manner.

Modified from [FPSS96C]

Page 7: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 7

KDD Process Ex: Web Log

Selection: Select log data (dates and locations) to use

Preprocessing: Remove identifying URLs Remove error logs

Transformation: Sessionize (sort and group)

Data Mining: Identify and count patterns Construct data structure

Interpretation/Evaluation: Identify and display frequently accessed sequences.

Potential User Applications: Cache prediction Personalization

Page 8: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 8

Basic Data Mining Tasks

Classification maps data into predefined groups Pattern Recognition Regression

Clustering partitions database into groups Groups not known apriori Determined by the data (similarity)

Link Analysis uncovers relationships among data Association Rules

• Ex: 60% of the time bread is sold so is peanut butter Sequence Analysis

• Ex: Most people who purchase CD players will purchase a CD within one week

Not causal Not functional dependencies

Page 9: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 9

Survey of Data Mining Tasks

Classification• Decision Trees• Neural Networks

Clustering• Agglomerative• Partitional

Association Rules Web Mining

Page 10: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 10

Classification Problem

Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DC where each ti is assigned to one class.

Actually divides D into equivalence classes. Prediction is similar, but may be viewed as

having infinite number of classes.

Page 11: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 11

Classification Examples

Pattern matching Fraud detection Identification of plant/animal specifies Profiling (this is not a bad word) Predicting terrorists or potential

terrorist events Web searches (Information Retrieval)

Page 12: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 12

Defining Classes

Partitioning Based

Distance Based

Page 13: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 13

Decision Trees

Decision Tree (DT): Tree where the root and each internal node is labeled

with a question. The arcs represent each possible answer to the

associated question. Each leaf node represents a prediction of a solution to

the problem. Popular technique for classification; Leaf node indicates

class to which the corresponding tuple belongs.

Page 14: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 14

Decision Tree Example

Page 15: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 15

Neural Networks

Based on observed functioning of human brain. (Artificial Neural Networks (ANN) Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical

viewpoint. Alternatively, a NN may be viewed from the

perspective of matrices. Used in pattern recognition, speech recognition,

computer vision, and classification.

Page 16: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 16

Classification Using Neural Networks

Typical NN structure for classification: One output node per class Output value is class membership function

value Supervised learning For each tuple in training set, propagate it

through NN. Adjust weights on edges to improve future classification.

Algorithms: Propagation, Backpropagation, Gradient Descent

Page 17: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 17

Neural Network Example

Page 18: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 18

Propagation

Tuple Input

Output

Page 19: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 19

Backpropagation

Error

Page 20: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 20

Clustering Problem

Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:D{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k.

A Cluster, Kj, contains precisely those tuples mapped to it.

Unlike classification problem, clusters are not known a priori.

Page 21: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 21

Clustering Examples

Segment customer database based on similar buying patterns.

Group houses in a town into neighborhoods based on similar features.

Identify new plant species Identify similar Web usage patterns

Page 22: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 22

Agglomerative Example

A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

BA

E C

D

4

Threshold of

2 3 51

A B C D E

Page 23: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 23

Association Rule Problem

Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence.

Link Analysis NOTE: Support of X Y is same as support of

X Y.

Page 24: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 24

Example: Market Basket Data

Items frequently purchased together:

Bread PeanutButter Uses:

Placement Advertising Sales Coupons

Objective: increase sales and reduce costs

Page 25: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 25

Association Rule Definitions

Set of items: I={I1,I2,…,Im}

Transactions: D={t1,t2, …, tn}, tj I

Itemset: {Ii1,Ii2, …, Iik} I

Support of an itemset: Percentage of transactions which contain that itemset.

Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

Page 26: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 26

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

Page 27: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 27

Web Data

Web pages Intra-page structures Inter-page structures Usage data Supplemental data

Profiles Registration information Cookies

Page 28: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 28

Web Structure Mining

Mine structure (links, graph) of the Web PageRank Create a model of the Web organization. May be combined with content mining to more effectively

retrieve important pages.

Page 29: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 29

PageRank

Used by Google Prioritize pages returned from search by looking at

Web structure. Importance of page is calculated based on number of

pages which point to it – Backlinks. Weighting is used to provide more importance to

backlinks coming form important pages. PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)

PR(i): PageRank for a page i which points to target page p.

Ni: number of links coming out of page i

Page 30: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 30

Web Usage Mining

Extends work of basic search engines Search Engines

IR application Keyword based Similarity between query and document Crawlers Indexing Profiles Link analysis

Page 31: 10/30/021 ME DATA MINING OVERVIEW Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

10/30/02 31

Web Usage Mining Applications

Personalization Improve structure of a site’s Web

pages Aid in caching and prediction of future

page references Improve design of individual pages Improve effectiveness of e-commerce

(sales and advertising)