59
Data Mining Techniques for Query Relaxation

Data Mining Techniques for Query Relaxation

  • Upload
    bracha

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Mining Techniques for Query Relaxation. Query Relaxation via Abstraction. Abstraction must be automated for Large domains Unfamiliar domains. Abstraction is context dependent: 6’9” guard  big guard 6’9” forward  medium forward 6’9” center  small center. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Mining Techniques for Query Relaxation

Data Mining Techniques for Query Relaxation

Page 2: Data Mining Techniques for Query Relaxation

2

Query Relaxation via Abstraction

Abstraction is context dependent:

6’9” guard big guard

6’9” forward medium forward

6’9” center small center

< 6’ <= 6’4”

> 6’4”

small medium

large

Heights of

guards A conceptual query:

Find me a big guard

Abstraction must be automated for

•Large domains

•Unfamiliar domains

Page 3: Data Mining Techniques for Query Relaxation

3

Related Work

Maximum Entropy (ME) method: Maximization of entropy (- p log p) Only considers frequency distribution

Conceptual clustering systems: Only allows non-numerical values (COBWEB) Assume a certain distribution (CLASSIT)

Page 4: Data Mining Techniques for Query Relaxation

4

Supervised vs. Unsupervised LearningSupervised Learning:Given instances with known class information,

generate rules/decision tree that can be used to infer class of future instances

Examples: ID3, Statistical Pattern Recognition

Unsupervised Learning:Given instances with unknown class information,

generate concept tree that cluster instances into similar classes

Examples: COBWEB, TAH Generation (DISC, PBKI)

Page 5: Data Mining Techniques for Query Relaxation

5

Automatic Construction of TAHs

Necessary for Scaling up CoBaseSources of Knowledge Database Instance

Attribute Value Distributions Inter-Attribute Relationships

Query and Answer Statistics Domain Expert

Approach Generate Initial TAH

With Minimal Expert Effort Edit the Hierarchy to Suit

Application Context User Profile

Page 6: Data Mining Techniques for Query Relaxation

For Clustering Attribute Instances with Non-Numerical

Values

Page 7: Data Mining Techniques for Query Relaxation

7

Pattern-Based Knowledge Induction (PKI)

Rule-Based

Cluster attribute values into TAH based on other attributes in the relation

Provides Attribute Correlation value

Page 8: Data Mining Techniques for Query Relaxation

8

Definitions

The cardinality of a pattern P, denoted |P|, is the number of distinct objects that match P.

The confidence of a rule A B, denoted by (A B), is

(A B) = |A B| / |A|

Let A B be a rule that applies to a relation R. The support of the rule over R is defined as

(A B) = |A| / |R|

Page 9: Data Mining Techniques for Query Relaxation

9

Knowledge Inference: A Three-Step Process

Step 1: Infer RulesConsider all rules of basic form A B.Calculate Confidence and Support.

Confidence measures how well a rule applies to the database.

A B has a confidence of .75 means that if A holds, B has a 75% chance of holding as well.

Support measures how often a rule applies to the database.

A B has a support of 10 means that it applies to 10 tuples in the database (A holds for 10 tuples).

Page 10: Data Mining Techniques for Query Relaxation

10

Knowledge Inference (cont’d)

Step 2: Combine RulesIf two rules share a consequence and have the same

attribute as a premise (with different values), then those values are candidates for clustering.

Color = red style = “sport” (1)

Color = black style = “sport” (2)

Suggests red and black should be clustered.

Correlation is product of the confidences of the two rules:

= 1 x 2

Page 11: Data Mining Techniques for Query Relaxation

11

Clustering

Algorithm: Binary Cluster (Greedy Algorithm)

repeatINDUCE RULES and determine sort in descending orderfor each (ai, aj)

if ai and aj are unclustered

replace ai and aj in DB with joint value Ji,j until fully clustered

Approximate n-ary using binary cluster a set of n values if the between all pairs is above threshold

Decrease threshold and repeat

Page 12: Data Mining Techniques for Query Relaxation

12

Knowledge Inference (cont’d)

Step 3: Combine CorrelationsClustering Correlation between two values is the

weighted sum of their correlations.

Combines all the evidence that two values should be clustered together into a single number ((a1, a2)).

(a1, a2) = i = 1

wi x (A = a

1 B

i = b

i) x (A = a

2 B

i = b

i) / (m-1)

Where a1, a2 are values of attribute A, and there are m attributes B1, …, Bm in the relation with corresponding weights w1, …, wm

m

Page 13: Data Mining Techniques for Query Relaxation

13

Pattern-Based Knowledge Induction (Example)

A B C

a1 b1 c1

a1 b2 c1

a2 b1 c1

a3 b2 c1Rules:

A = a1 B = b1 confidence = 0.5

A = a2 B = b1 confidence = 1.0

A = a1 C = c1 confidence = 1.0

A = a2 C = c1 confidence = 1.0

correlation (a1, a2) = 0.5x1.0+1.0x1.0/ 2 = 0.75

correlation (a1, a3) = 0.75

correlation (a2, a3) = 0.5

1st iteration

Page 14: Data Mining Techniques for Query Relaxation

14

Pattern-Based Knowledge Induction (cont’d)

A B C

a12 b1 c1

a12 b2 c1

a12 b1 c1

a3 b2 c1

A = a12 B = b2 confidence = 0.33

A = a3 B = b2 confidence = 1.0

A = a12 C = c1 confidence = 1.0

A = a3 C = c1 confidence = 1.0

correlation (a12, a3) = = 0.67

0.33x1.0+1.0x1.02

a1 a2

a3

0.67

0.75

2nd iteration

Page 15: Data Mining Techniques for Query Relaxation

15

Example for Non-Numerical Attribute ValueThe PEOPLE Relation

Page 16: Data Mining Techniques for Query Relaxation

16

TAH for People

Page 17: Data Mining Techniques for Query Relaxation

17

Cor(a12, a3) is computed as follows: Attribute origin: Same (Holland)

contributes 1.0 Attribute hair: Same

contributes 1.0 Attribute eye: Different

contributes 0.0 Attribute height: Overlap on MEDIUM

5/10 of a12 and 2/2 of a3contributes 5/10 * 2/2 = 0.5

cor(a12, a3) = 1/4 * (1+1+0+0.5) = 0.63

Page 18: Data Mining Techniques for Query Relaxation

18

Correlation Computation

Compute correlation between European and Asian.

Attributes ORIGIN and HAIR COLOR No overlap between Europe and Asia, no contributions to correlation

Attribute EYE COLOR BROWN is the only attribute that has overlap 1 out of 24 Europeans have BROWN 12 out of 12 Asians have BROWN Attribute BROWN contributes 1/24 * 12/12 = 0.0416

Attribute Height SHORT: 5/24 Europeans and 8/12 of Asians Medium: 11/24 and 3/12 Tall: 8/24 and 1/12 Attribute HEIGHT contributes

5/24 * 8/12 + 11/24 * 3/12 + 8/12 * 1/12 = 0.2812

Total Contribution = 0.0416 + 0.2812 = 0.3228Correlation = 1/4(0.3228) = 0.0807

Page 19: Data Mining Techniques for Query Relaxation

19

Extensions

Pre-clustering For non-discrete domains Reduces computational complexity

Expert Direction Identify complex rules Eliminate unrelated attributes

Eliminating Low-Popularity Rules Set Popularity Threshold q Do not keep rules below q Saves Time and Space Loses Knowledge about Uncommon Data

In the Transportation Example, q = 2 improves efficiency by nearly 80%.

Statistical sampling for very large domains.

Page 20: Data Mining Techniques for Query Relaxation

Clustering of Attribute Instances with Numerical

Values

Page 21: Data Mining Techniques for Query Relaxation

21

Conventional Clustering Methods:I. Maximum Entropy (ME)

Maximization of entropy (- p log p)Only considers frequency distribution:Example: {1,1,2,99,99,100} and

{1,1,2,3,100,100}have the same entropy (2/6,1/6,2/6,1/6)

ME cannot distinguish between(1) {1,1,2},{99,99,100}: good partition(2) {1,1,2},{3,100,100}: bad partition

Me does not consider value distribution.

Clusters have no semantic meaning.

Page 22: Data Mining Techniques for Query Relaxation

22

Conventional Clustering Methods:II. Biggest Gap (BG)

Consider only value distributionFind cuts at biggest gaps{1,1,1,10,10,20} is partitioned to{1,1,1,10,10} and {20} bad

A good partition:{1,1,1} and {10,10,20}

Page 23: Data Mining Techniques for Query Relaxation

23

New Notion of “Goodness” of Clusters – Relaxation Error

Page 24: Data Mining Techniques for Query Relaxation

24

Relaxation Error of a Cluster

Page 25: Data Mining Techniques for Query Relaxation

25

Relaxation Error of a Partition

Page 26: Data Mining Techniques for Query Relaxation

26

Distribution Sensitive Clustering (DISC) Example

A

B C

1 2 3 4 5

1 0+1+2 = 33 3 = 9( )

1 1+0+1 = 23 3 = 9( )

1 2+1+0 = 33 3 = 9( )

Page 27: Data Mining Techniques for Query Relaxation

27

Relaxation Error:

RE(B) = average pair-wise difference

= 3 + 2 + 3 = 8

9 9 9 9

RE(C) = 0.5

RE(A) = 2.08

correlation (B) = 1 - RE(B) = 1 - 0.89 = 0.57

RE(A) 2.08

correlation (C) = 1- 0.5 = 0.76

2.08

correlation (A) = 1- 2.08 = 0

2.08

Page 28: Data Mining Techniques for Query Relaxation

28

Examples

Example 1: {1,1,2,3,100,100}ME: {1,1,2},{3,100,100}RE({1,1,2}) = (0+1+0+1+1+1)/9 = 0.44RE({3,100,100}) = 388/9 = 43.11RE({1,1,2},{3,100,100}) = 0.44*3/6 + 43.11*3/6 = 21.78

Ours: RE({1,1,2,3},{100,100}) = 0.58

Example 2: {1,1,1,10,10,20}BG: {1,1,1,10,10},{20}

RE({1,1,1,10,10},{20}) = 3.6

Ours: RE({1,1,1},{10,10,20}) = 2.22

Page 29: Data Mining Techniques for Query Relaxation

29

An Example

Example:

The table SHIPS has 153 tuples and the attribute LENGTH has 33 distinct values ranging from 273 to 947. DISC and ME are used to cluster LENGTH into three sub-concepts: SHORT, MEDIUM, and LONG.

Page 30: Data Mining Techniques for Query Relaxation

30

An Example (cont’d)

Cuts by DISCbetween 636,652 and 756,791average gap = 25.5

Cuts by MEbetween 540,560 and 681,685 (a bad cut)average gap = 12

Optimal cuts by exhaustive search:between 605,635 and 756,791average gap = 32.5

DISC is more effective than ME in discovering relevant concepts in the data.

Page 31: Data Mining Techniques for Query Relaxation

31

An Example

Clustering of SHIP.LENGTH by DISC and ME

Cuts by DISC: - - -

Cuts by ME: - . - .

Page 32: Data Mining Techniques for Query Relaxation

32

Quality of Approximate Answers

Page 33: Data Mining Techniques for Query Relaxation

33

DISC

For numeric domainsUses intra-attribute knowledge

Sensitive to both frequency and value distributions of data.

RE = average difference between exact and approximate answers in a cluster.

Quality of approximate answers are measured by relaxation error (RE): the smaller the RE, the better the approximate answer.

DISC (Distribution Sensitive Clustering) generates AAHs based on minimization of RE.

Page 34: Data Mining Techniques for Query Relaxation

34

DISC

Goal: automatic generation of TAH for a numerical attribute

Task: given a numerical attribute and a number s, find the “optimal” s-1 cuts that partition the attribute into s sub-clusters

Need a measure for optimality of clustering.

Page 35: Data Mining Techniques for Query Relaxation

35

Quality of Partitions

If RE(C) is too big, we could partition C into smaller clusters.

The goodness measure for partitioning C into m sub-clusters {C1, …, Cm} is given by the relaxation error reduction per cluster (category utility CU)

CU =

RE (C ) – k=1 P (Ck) RE (Ck)m

m

For efficiency, use binary partitions to obtain m-ary partitions.

C2 . . .C1 Cm

CPartition C to C1, …, Cm to maximize RE reduction

Further partition

Page 36: Data Mining Techniques for Query Relaxation

36

The Algorithms DISC and BinaryCut

Algorithm DISC(C)

if the number of distinct values C < T, return /* T is a threshold */

let cut = the best cut returned by BinaryCut(C)

partition values in C based on cut

let the resultant sub-clusters be C1 and C2

call DISC(C1) and DISC(C2)

Algorithm BinaryCut(C)

/* input cluster C = {x1, …, xn} */

for h =1 to n – 1 /* evaluate each cut */

Let P be the partition with clusters C1 = {x1, …, xh} and

C2 = {xh+1, …, xn}

computer category utility CU for P

if CU < MinCU then

MinCU = CU, cut = h /* the best cut */

Return cut as the best cut

Page 37: Data Mining Techniques for Query Relaxation

37

The N-ary Partition Algorithm

Algorithm N –ary Partition(C)let C1 and C2 by the two sub-clusters of C

compute CU for the partition C1, C2

for N = 2 to n – 1let Ci by the sub-cluster of C with maximum relaxation error

call BinaryCut to find the best sub-clusters Ci1 and Ci2 of Ci

compute and store CU for the partition C1, …, Ci-1, Ci1, Ci2, Ci+1, …, CN

if current CU is less than the previous CUstop

elsereplace Ci by Ci1 and Ci2

/* the result is an N –ary partition of C */

Page 38: Data Mining Techniques for Query Relaxation

38

Using TAHs for Approximate Query Answering

select CARGO-ID

from CARGOS

where SQUARE-FEET = 300

and WEIGHT = 740

no answers

The query is relaxed according to TAHs.

Page 39: Data Mining Techniques for Query Relaxation

39

Approximate Query Answering

select CARGO-ID

from CARGOS

where 294 < SQUARE-FEET < 300

and 737 < WEIGHT < 741

CARGO-ID SQUARE-FEET WEIGHT 10 296 740

Relaxation error = (4/11.95+0)/2 = 0.168

Further Relaxation:

select CARGO-ID

from CARGOS

where 294 < SQUARE-FEET < 306

and 737 < WEIGHT < 749

CARGO-ID SQUARE-FEET WEIGHT 10 296 740 21 301 737 30 304 746 44 306 745

Relaxation error = (3.75/11.95+3.5/9.88)/2 = 0.334

Page 40: Data Mining Techniques for Query Relaxation

40

Performance of DISC

Theorem: Let D and M be the optimal binary cuts by DISC and ME respectively. If the data distribution is symmetrical with respect to the median, then D = M (i.e., the cuts determined by DISC and ME are the same).

For skewed distributions, clusters discovered by DISC have less relaxation error than those by the ME method.

The more skewed the data, the greater the performance difference between DISC and ME.

Page 41: Data Mining Techniques for Query Relaxation

41

Multi-Attribute TAH (MTAH)

In many applications, concepts need to be characterized by multiple attributes, e.g., near-ness of geographical locations.

As MTAH

•As a guidance for query modification

•As a “semantic index”

Page 42: Data Mining Techniques for Query Relaxation

42

Multi-Attribute TAH (MTAH)

Page 43: Data Mining Techniques for Query Relaxation

43

Multi-Attribute DISC (M-DISC) Algorithm

Algorithm M-DISC(C)if the number of objects in C < T, return /* T is a threshold */for each attribute a = 1 to m

for each possible binary cut hcompute CU for hif CU > MaxCU then /* remember the best cut */

MaxCU = CU, BestAttribute = a, cut = hpartition C based on cut of the attribute BestAttributelet the resultant sub-clusters be C1 and C2

call M-DISC(C1) and M-DISC(C2)

Page 44: Data Mining Techniques for Query Relaxation

44

Greedy M-DISC Algorithm: gM-DISC

Algorithm gM-DISC(C)if the number of objects in C < T, return /* T is a threshold */for each attribute a = 1 to m

for each possible binary cut hcompute REa for h

if REa > Max RE then /* remember the best cut */

Max RE = REa, BestAttribute = a, cut = hpartition C based on cut of the attribute BestAttributelet the resultant sub-clusters be C1 and C2

call gM-DISC(C1) and gM-DISC(C2)

Page 45: Data Mining Techniques for Query Relaxation

45

MTAH of RECTANGLES (Height, Width)

Page 46: Data Mining Techniques for Query Relaxation

46

The Database Table AIRCRAFT

How to find similar aircrafts?

Page 47: Data Mining Techniques for Query Relaxation

47

MTAH for AIRCRAFT

Page 48: Data Mining Techniques for Query Relaxation

48

Example for Numerical Attribute Value

Motor Data from PartNet(http://PartNet)

Page 49: Data Mining Techniques for Query Relaxation

49

TAH for Motor Capability

Page 50: Data Mining Techniques for Query Relaxation

50

TAH for Motor Size and Weight

Page 51: Data Mining Techniques for Query Relaxation

51

TAHs for Motor

The Motor table was adapted from Housed Torque from Part Net. After inputting the data, two TAHs were generated automatically from the DISC algorithm.

One TAH was based on peak torque, peak torque power, and motor constant. The other was based on outer diameter, length, and weight. The leaf nodes represent part number. THE intermediate nodes are classes. The relaxation error (average pair-wise distance between the parts) of each node are also given.

Page 52: Data Mining Techniques for Query Relaxation

52

Application of TAHs

The TAHs can be used jointly to satisfy attributes in both TAHs. For example, find part similar to “T-0716” in terms of peak torque, peak torque power, motor constant, outer diameter, length, and weight. By examining both TAHs, we know that QT-0701 is similar to T-0716 with an expected relaxation error of (0.06 + 0.1)/2 = 0.08

Page 53: Data Mining Techniques for Query Relaxation

53

Performance of TAH

Performance measures:

accuracy =

efficiency =

where “all relevant answers” are the best n answers determined by exhaustive search.

Compare an MTAH with a traditional 2-d index tree (based on frequency distribution).

retrieved relevant answers

retrieved relevant answers

all relevant answers

all retrieved answers

Page 54: Data Mining Techniques for Query Relaxation

54

Performance of MTAHs

Based on attributes Longitudes and Latitudes of 972 geographical locations from a transportation database.

500 queries with the form:

“find the n locations nearest to (long,lat)”where n is randomly selected from 1 to 20, and long and lat are

generated based on the distributions of the geographical locations.

efficiency 0.54 0.53 0.64 0.011

accuracy 0.85 0.84 0.68 1.0

error 1.14 1.17 1.57 1.0

MTAH GMTAH ME-Tree E-S

MTAH is more accurate than 2-d-tree.

MTAH is more efficient than Exhaustive Search.

Page 55: Data Mining Techniques for Query Relaxation

55

Generation of Evolutionary TAH

Approximate query answering for temporal data (given as a set of time sequences):

Find time sequences that are similar to a given template sequence.

A time sequence S of n stages is defined as an n-tuple: S = {s1, …, sn} where si is a numerical value.

Issues: Needs a similarity measure for sequences Use clustering for efficient retrieval Evaluation of work

Page 56: Data Mining Techniques for Query Relaxation

56

Automatic Constructions of TAHs

Necessary for scaling up CoBaseSources of Knowledge Database Instance

Attribute Value Distributions Inter-Attribute Relationships

Query and Answer Statistics Domain Exert

Approach Generate Initial TAH

With Minimal Expert Effort Edit the Hierarchy to Suit

Application Context User Profile

Page 57: Data Mining Techniques for Query Relaxation

57

The CoBase Knowledge-Base Editor

Tool for Type Abstraction Hierarchies Display available TAHs Visualize TAHs as graphs Edit TAHs

Add/Delete/Move nodes and sub-trees Assign names to nodes

Interface to Knowledge Discovery ToolsCooperative Operators Specify parameter values Approximate Near-To, Similar-To

Page 58: Data Mining Techniques for Query Relaxation

58

An Example of Using the KB Editor

Page 59: Data Mining Techniques for Query Relaxation

59