Upload
ray
View
34
Download
4
Embed Size (px)
DESCRIPTION
Pushing Aggregate Constraints by Divide-and-Approximate. Ke Wang, Yuelong Jiang, Jeffrey Xu Yu, Guozhu Dong and Jiawei Han. No Easy to Push Constraints. The exists a gap between the interesting criterion and the techniques used in mining patterns from a large amount of data - PowerPoint PPT Presentation
Citation preview
1
Pushing Aggregate Constraints by Divide-and-Approximate
Ke Wang, Yuelong Jiang, Jeffrey Xu Yu,
Guozhu Dong and Jiawei Han
2
No Easy to Push Constraints
The exists a gap between the interesting criterion and
the techniques used in mining patterns from a large
amount of data Anti-monotonicity is too loose as a pruning strategy.
Anti-monotonicity is too restricted as an interesting criterion.
Should we design new algorithms to mine those
patterns that can only be found using anti-
monotonicity? Mining patterns with “general” constraints
3
Iceberg-Cube Mining
A iceberg-cube mining query select A, B, C, count(*) from R cube by A, B, C having count(*) >= 2
Count(*) >= 2 is an anti-monotone constraint.
A B C M
1 2 3 100
1 2 4 200
2 2 4 300
A B C Count(*)
1 - - 2
- 2 - 2
- - 4 2
1 2 - 2
- 2 4 2
4
Iceberg-Cube Mining
Another query select A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150
sum(M) >= 150 is an anti-monotone constraint, when all values in M are positive.
sum(M) >= 150 is not an anti-monotone constraint, when some values in M are negative.
A B C M
1 2 3 100
1 3 4 20
2 3 4 130
A B C M
1 2 3 -100
1 3 4 200
2 3 4 -300
R2
R1
5
The Main Idea
Study Iceberg-Cube Mining Consider f(v) θ σ
f is a function with SQL-like aggregates and arithmetic operators (+, -, *, /); v is a variable; σ is a constant, and θ is either ≤ or ≥.
Can we push the constraints into iceberg-cube mining that are not anti-monotone or monotone? If so, what is pushing method that is not specific to a particular constraint?
Divide-Approximate: find a “stronger approximate” for the constraint in a subspace.
6
Some Definitions
A relation with many dimensions Di and one or more measures Mi.
A cell is, di…dk, from Di, …, Dk. Use c as a cell variable Use di…dk for a cell value
(representative) SAT(d1…dk) (or SAT(c)) contains all
tuples that contains all values in d1…dk (or c).
C’ is a super-cell of c, or c is a sub-cell of c’, if c’ contains all the values in c.
Let C be a constraint (f(v) θ σ). CUBE(C) denotes the set of cells that satisfy C.
A constraint C is weaker than C’ if CUBE(C’) ⊆ CUBE(C)
A B C M
1 2 3 -100
1 3 4 200
2 3 4 -300
7
An Example
Iceberg-Cube Mining select A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150
sum(c) >= 150 is neither anti-monotone nor monotone. Let the space be S = {ABC, AB, AC, BC, A, B, C} Let sum(c) = psum(c) – nsum(c) >= 150.
psum(c) is the profit, and nsum(c) is the cost. Push an anti-monotone approximator
Use psum(c) >= 150, and ignore nsum(c). If nsum(c) is large, there are have many false positive.
Use a min nsum in S: psum(c) – nsummin(ABC) >= 150.
nsummin(ABC) is the minimum nsum in S. Use a min nsum in a subspace of S (a stronger constraint)
A B C M
1 2 3 -100
1 3 4 200
2 3 4 -300
8
The Search Strategy (using a lexicographic tree)
A node represents a group-by BUC (BottomUpCube):
Partition the database in the depth-first order of the lexicographic tree.
0
AB C D
E
AB AC ADAE
DE
ACE
ADEABC
BDBC
BCD
CEBE
CD
ACD
ACDE
CDEBDE
BCDE
ABCD
ABCDE
ABDE
BCE
ABCE
ABE
ABD
A B C D E M
1 2 3 4 5 100
1 2 3 4 5 100
1 2 3 4 5 -150
1 2 3 4 5 -100
1 2 3 4 6 50
1 2 3 5 6 40
1 2 4 5 6 400
9
Another Example
Iceberg-Cube Mining select A, B, C, D, E, sum(M) from R cube by A, B, C having sum(M) >= 200
At node ABCDE, sum(12345) = psum(12345) – nsum(12345) = 200 – 250 = -50. (fails).
Backtracking to ABC, psum(123) – nsummin(12345) = 290 - 100
= 190. (fails) Then, at node ABCE, p[1235], must fail. Therefore, all tuples,
t[1235], can be pruned.
A B C D E M
1 2 3 4 5 100
1 2 3 4 5 100
1 2 3 4 5 -150
1 2 3 4 5 -100
1 2 3 4 6 50
1 2 3 5 6 40
1 2 4 5 6 400
10
Find a cell p at u0 fails C, and then extract an anti-monotone approximator Cp.
Consider an ancestor uk of u0, where u0 is the left-most leaf in tree(uk).
p[u] denote p projected onto u (a cell of u). tree(uk, p) = {p[u] | u is a node in tree(uk)}.
p is the max cell in tree(uk, p) and p[uk] is the min cell. In tree(uk, p).
If p[uk] fails Cp, all cells in tree(uk, p) fails. Note: tree(uk, p) ≠ tree(uk, p’) if p’ ≠ p.
0
A
B CD
E
ABAC
AD AE
DE
ACE
ADEABC
BDBC
BCD
CE
BE
CD
ACD
ACDE
CDEBDE
BCDE
ABCD
ABCDE ABDE
BCE
ABCE
ABE
ABD
uk
u0
Tree(uk)
A node in tree(uk) is group-by attributes
A cell in tree(uk, p) is group-by valuesu0’
uk’
11
On the backtracking from u0 to uk Check if u0 is on the left-most path in tree(uk) Check if p[uk] can use the same anti-monotone approximator as
p[u0] Check if p[uk] fails Cp.
If all conditions are met, then For every unexplored child ui of uk, we prune all the tuples that
match p on tail(ui), because such tuples generate only cells in tree(uk, p), which fail Cp.
tail(u): the set of all dimensions appearing in tree(u).
0
A
B C D E
ABAC
AD AE
DE
ACEADEABC
BDBC
BCD
CEBE CD
ACD
ACDE
CDEBDE
BCDE
ABCD
ABCDE ABDE
BCE
ABCE
ABE
ABD
uk
u0
Tree(uk)
The Pruning
12
Suppose that a cell p[ABCDE] fails. On the backtracking from ABCDE to ABC,
If conditions are met (p[ABC] fails)
Prune tuples such that t[ABCE] = p[ABCE]
On the backtracking from ABC to AB, If conditions are met (p[AB] fails)
Prune tuples such that t[ABDE] = p[ABDE] from tree (ABD)
Prune tuples such that t[ABE] = p[ABE] from tree(ABE)
0
A
B C D E
ABAC
AD AE
DE
ACEADEABC
BDBC
BCD
CEBE CD
ACD
ACDE
CDEBDE
BCDE
ABCD
ABCDE ABDE
BCE
ABCE
ABE
ABD
uk
u0
ui
ui’ uk’
Given a leaf node u0 and a cell p at u0.
Let the leftmost path uk…u0 in tree(uk), k >= 0.
p is a pruning anchor wrt (uk,u0).
Tree(uk, p) the pruning scope.
13
The D&A Algorithm
Modify BUC. Push up a pruning anchor
p along the leftmost path from u0 to uk.
Partition the prunning anchors pushed up to the current node, in addition to partitioning the tuples
A B C D E M
1 2 3 4 5 100
1 2 3 4 5 100
1 2 3 4 5 -150
1 2 3 4 5 -100
1 2 3 4 6 50
1 2 3 5 6 40
1 2 4 5 6 400
14
With Min-Support
Suppose cell abcd is frequent, but cell abcde is infrequent. (Shoud stop at abcd)
If cell abcd is anchored at node A, cannot prune ae, abe, ace, ade in tree(A, abcd).
0
AB C D
E
AB AC AD AE DE
ACE
ADEABC
BDBC
BCD
CEBE CD
ACD
ACDE
CDEBDE
BCDE
ABCD
ABCDE
ABDE
BCE
ABCE
ABE
ABD
A B C D E M
1 2 3 4 5 100
1 2 3 4 5 10
1 2 3 4 8 -50
1 2 3 4 8 -100
1 2 3 5 6 -50
1 2 3 5 6 40
1 2 3 5 7 10
Min-sup = 3sum(M) >= 100
15
Rollback tree
RBtree(AD), RBtree(AC), RBtree(ABD), RBtree(D), RBtree(C), and RBtree(B) do not have E.
If abcd is anchored at the root, we can prune tuples from RBtree(D), RBtree(C), and RBtree(B).
0
A E D CB
ABAC
ADAE
CB
AEC AEDABC
EDEB
EBC
DCEC DB
ADC
AECD
DBC
EDC
BBCD
ABCD
ABCDE
ABED
EBD
ABCE
ABEABD
A B C D E M
1 2 3 4 5 100
1 2 3 4 5 10
1 2 3 4 8 -50
1 2 3 4 8 -100
1 2 3 5 6 -50
1 2 3 5 6 40
1 2 3 5 7 10
Min-sup = 3sum(M) >= 100
16
Constraint/Function Monotonicity
A constraint C is a-monotone if whenever a cell is not in CUBE(C), neither is any super-cell.
A constraint C is m-monotone if whenever a cell is in CUBE(C), so its every super-cell.
A function x(y) is a-monotone wrt y if x decreases as y grows (for cell-valued y) or increases (for real-valued y).
A function x(y) is m-monotone wrt y if x increases as y grows (for cell-valued y) or increases (for real-values y).
An example: sum(v) = psum(v) – nsum(v) sum(v) is m-monotone wrt psum(v) sum(v) is a-monotone wrt nsum(v)
17
Constraint/Function Monotonicity Let a denote m, and m denote a. Let τ denote either a or m.
Example: psum(v) ≥ σ is a-monotone, then psum(v) ≤ σ is m-monotone
If psum(c1) ≥ σ is not held, then psum(c2) ≥ σ is not true, where c2 is a super cell of c1. (say c1 is a cell of ABC, and c2 is a cell of ABCD)
f(v) ≥ σ is τ-monotone if and only if f(v) is τ-monotone wrt v. f(v) ≤ σ is τ-monotone if and only if f(v) is τ-monotone wrt v. An example: sum(v) = psum(v) – nsum(v) ≥ σ.
sum(v) ≥ σ is m-monotone with psum(v), because sum(v) is m-monotone wrt psum(v).
sum(v) ≥ σ is a-monotone with nsum(v), because sum(v) is a-monotone wrt nsum(v).
18
Find Approximators Consider f(v) ≥ σ. Divide f(v) ≥ σ into two groups.
A+: As cell v grows (becomes a super cell), f monotonically increases. A-: As cell grows (becomes a super cell), f monotonically decreases.
Consider sum(v) = psum(v) – nsum(v) ≥ σ. A+ = {nsum(v)} A- = {psum(v)}
f(A+; A-/cmin) ≥ σ and f(A+/cmin; A-) ≤ σ are m-monotone approximators in a subspace Si, where cmin is a min cell instantiation in Si.
f(A+/cmax; A-) ≥ σ and f(A+; A-/cmax) ≤ σ are a-monotone approximators in a subspace Si, where cmax is a max cell instantiation in Si.
sum(nsum/cmax; psum) ≥ σ
19
Separate Monotonicity
Consider function rewriting: (E1 + E2) * E into E1 * E + E2 * E.
Consider space division divide a space into subspaces, Si.
Find approximators using equation rewriting techniques for a subspace, Si.
20
Experimental Studies
Consider sum(v) = psum(v) – nsum(v) Three algorithms
BUC: push only the minimum support. BUC+: push approximators and mininum
support. D&A: push approximators and minimum
support.
21
Vary minimum support
200
250
300
350
400
450
500
0.02 0.05 0.1 0.2 0.5
Minimum Support (%)
Tim
e (
seco
nd
)
BUCBUC+D&A
22
Without minimum support
200
220240
260
280
300320
340
360380
400
50 100 150 200 250
Sigma
Tim
e (
seco
nd
)
BUC+D&A
*) psum(v) >= sigma
23
Scalability
0
500
1000
1500
2000
2500
15 16 17 18 19 20 21
Dimension Number
Tim
e (s
econ
d)
BUCBUC+D&A
0
200400
600
800
10001200
1400
16001800
2000
200 400 600 800 1000
Data Number (thousand)
Tim
e (s
econ
d)
BUCBUC+D&A
24
Conclusion
General aggregate constraints, rather than only well-behaved constraints.
SQL-like tuple-based aggregates, rather than item-based aggregates.
Constraint independent techniques, rather than constraint specific techniques
A new push strategy: divide-and-approximate