21
1 Jerry Post Copyright © 2003 Database Management Database Management Systems: Systems: Data Mining Data Mining Data Compression

Database Management Systems: Data Mining

Embed Size (px)

DESCRIPTION

Database Management Systems: Data Mining. Data Compression. The Need for Data Compression. Noisy data Need to combine data to smooth it (bins) Too many values in a dimension Need to combine data to get smaller number of sets Hierarchies Rollup data into natural hierarchies - PowerPoint PPT Presentation

Citation preview

1

Jerry PostCopyright © 2003

Database Management Database Management Systems:Systems:Data MiningData Mining

Data Compression

2

DDAATTAA MMiinniinngg

The Need for Data Compression

Noisy dataNeed to combine data to smooth it (bins)

Too many values in a dimensionNeed to combine data to get smaller number of sets

HierarchiesRollup data into natural hierarchiesCreate additional hierarchies

Data compressionLarge text, images, time series (wavelet compression)

3

DDAATTAA MMiinniinngg

Bins Numerical data Create bins

Equal-depth partitions Equal-width partitions Bin mean/smoothing

Choose number of bins?

Data: 3, 9, 17, 27, 32, 32, 37, 48, 55

Bin 1: 3, 9, 17Bin 2: 27, 32, 32Bin 3: 37, 48, 55

Bin 1: 3 – 20.3 3, 9, 17Bin 2: 20.3 – 37.6 27, 32, 32, 37Bin 3: 37.6 – 55 48, 55

Bin 1: 9.6, 9.6, 9.6Bin 2: 30.3, 30.3, 30.3Bin 3: 46.6, 46.6, 46.6

4

DDAATTAA MMiinniinngg

What is Cluster Analysis?

The process of placing items into groups, where items within a group are similar to each other, and dissimilar to items in other groups.

Similar to Classification Analysis, but in classification, the group characteristics are known in advance (e.g., borrowers who successfully repaid loans).

5

DDAATTAA MMiinniinngg

Clustering

1. Find groups statistically

2. Eliminate outliers

6

DDAATTAA MMiinniinngg

Standardization

If there are multiple attributes with continuous variables, you should standardize them.

Z-Scores are a common approach, but using a standard squared-deviation places too much weight on outliers.

f

fifif

nffff

fnffffff

s

mxz

xxxn

m

mxmxmxn

s

)...(1

|)|...|||(|1

21

21

7

DDAATTAA MMiinniinngg

Distance Measure A simple distance measure is the l1 norm:

D(i,j) = |xi1-xj1| + |xi2-xj2| + … + |xin-xjn|

A more common measure is the Euclidean or l2 norm:D(i,j) = sqrt((xi1-xj1)2 + (xi1-xj1)2 + … + (xi1-xj1)2)

3 5

x1

x2

8

DDAATTAA MMiinniinngg

General Distance Measure

In general form:

|)(|max||||

)||...|||(|),(

...1

/12211

jkiknk

qqjnin

qji

qji

xxx

xxxxxxjid

9

DDAATTAA MMiinniinngg

Hierarchies

Dates

Year – Quarter – Month – Day

Year – Quarter – Week – Day

Business hierarchies

Division/Product

Function: Marketing, HRM, Production, …

Region

Region hierarchies

World – Continent – Nation – State – City

10

DDAATTAA MMiinniinngg

Clustering: Principal Components Find the primary factors that define the data (vectors Y1, Y2) Statistical packages can compute them quickly. Map raw data (x1, x2 points) into the two vectors.. Use the vectors instead of the raw data.

X1

X2

Y1Y2

11

DDAATTAA MMiinniinngg

Factor Analysis: Latent Variables

Some data is not observable, but can be measured through indicator variables. Classic example: human intelligence which can be evaluated through a variety of tests

Intelligence

IQ Test

SAT

ACT

12

DDAATTAA MMiinniinngg

Exploratory Factor Analysis

Survey (marketing) Many items (questions) Are they related? Can the data be described by a small number of factors?

Factor 1

Q1

Q2

Q3

Factor 2

Q4

Q5

Q6

13

DDAATTAA MMiinniinngg

Regression to Reduce Values

Estimate regression coefficients If good fit, use the coefficients to pick categories

14

DDAATTAA MMiinniinngg

Wavelet Compression

Find patterns (wavelets) in the data Standard compression methods (GIF) Reduces data to a small number of wavelet patterns Particularly useful for complex data

15

DDAATTAA MMiinniinngg

Clustering Technologies

K-means User specifies number of clusters (k) System minimizes intracluster distance (usually L2: sum of squared

errors).

K-medoids (selects representative center point) Hierarchical (BIRCH) Non-spherical (CURE) Density-based (DBSCAN, DENCLUE [best of group]) Grid-based (STING, CLIQUE)

Data Mining, Jiawei Han and Micheline Kamber, 2001; chapter 8

16

DDAATTAA MMiinniinngg

Classification

Classify an outcome based on attributes. Similar to prediction and attribute valuation, but generally splits attributes into groups.

Example: Decision TreeWill customers buy a new item?

Age

Student IncomeYes

No YesNo Yes

<=25 >5026…50

No Yes Low High

17

DDAATTAA MMiinniinngg

Decision Tree

Modeler specifiesThe dependent variable (outcome)The attribute variablesSample/training data

The system selects the nodes and the criteria to best fit the model. Possibly using the gain in information criteria.

Results are useful in data mining/OLAP/cube browser. The tree nodes become important sides of the cube, and the classification levels specify hierarchies.

18

DDAATTAA MMiinniinngg

Bayesian Classifier Goal: predict the probability that an

item belongs to a class. Begin with the naïve assumption: with

m classes, the a priori probability P(Ci) = 1/m

Bayes’ Theorem: The probability that sample X belongs in Class i is P(Ci|X).

Assume the attributes are independent, so you can multiply the probabilities. Estimate the attribute probabilities from the sample. For categorical data, sik is the number of training samples of Class Ci having the value xk for Ak and si is the number of training samples in Ci. For continuous variables, use a Gaussian (normal) distribution.

)(

)()|()|(

XP

CiPCiXPXCiP

n

kiki CxPCXP

1

)|()|(

i

ikik s

sCxP )|(

19

DDAATTAA MMiinniinngg

Neural Networks

Output Cells

Sensory Input Cells

Hidden Layer

Some of the connections

3

-2

7

4

Input weights

Incompletepattern/missing inputs.

20

DDAATTAA MMiinniinngg

Neural Networks: Back Propogation

Training sample with known results Define the network

Number of layers (You must have one hidden layer) Number of cells in each layer Normalize the data between 0.0 and 1.0

Initialize the weights (small random numbers) Propagate forward and compute the error Update the weights and biases working backwards

X0

X1

Xn

Σ

w0j

w1j

wnj

Inputs Weighted sum

f

bias

θj

Output

jIj

ijiijj

eO

OwI

1

1

21

DDAATTAA MMiinniinngg

NN: Error Calculations

Errj = Oj (1-Oj)(Tj-Oj)

Tj is the true output, Oj (1-Oj ) is the derivative of the logistic function

For the hidden layer: (l) is the learning rate

jj

ijij

kjkkjjj

Errl

OErrlw

wErrOOErr

)(

)(

)1(