Upload
patrick-walters
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
User-Defined Aggregates for Advanced Database Applications
Haixun [email protected]
Computer Science Dept.University of California, Los Angeles
2
State of the Art: the big picture Relational Databases:
Single data model of spartan simplicity Logic-based database languages Commercial success and dominance through SQL
standards (in the 80’s) A new wave of DB applications (in the 90’s)
Knowledge discovery, Data Mining, OLAP, Time Series analysis, Spatial/Temporal DBs, XML, …
Underscores limitations of RDBMS Prompted major extensions leading to SQL3
standards (SQL99 is a subset of SQL3)
3
State of the Art: R&D highlights Deductive DBs
Rule based syntax Logic based formalization of query language
semantics – e.g., nonmonotonic reasoning and stratification
Recursive queries OO-DBs
Complex data types / inheritance Expressive power by merging PL and query languages
OR-DBs Rich data types / Path Expression (SQL) UDFs and Database Extenders (Data Blades)
4
State of the Art: the seamy side A patchwork of major extensions
DBMSs have become more powerful but much hard and complex to build and use
Still not powerful enough Data Mining: clustering, classification, association New language constructs not helping either
Limited expressive power in other applications Bill of Materials (BoM) type of applications Temporal reasoning
5
This thesisThis thesis: Many of the problems can be solved by UDAs
User Defined Aggregates (UDAs): insufficient support in commercial world and
DB standards Our claim: UDAs provide a more general
and powerful mechanism for DB extensions AXL – a system to make it easier to define
UDAs AXL – where SQL and Data Mining intersect
6
A Brief History of AXL (and of my thesis)
Logic formalization of aggregates [DDLP’98, LBAI’00] Early returns, monotonic aggregates: used freely in recursive queries.
Extensions of LDL++ (Logic Database Language) SQL-AG: Implementing and extending SQL3 UDAs [DBPL’99]
Implemented on DB2 using extended user-defined aggregates with ‘early returns’
SADL: Simple Aggregate Definition Language [ICDE’00] using SQL to define new aggregates easy to use, but with limited performance and power
AXL: Aggregate eXtension Language [VLDB’00] powerful, efficient and still SQL-based
7
Defining UDAs in SQL3AGGREGATE FUNCTION myavg(val NUMBER)RETURN NUMBERSTATE stateINITIALIZE myavg_initITERATE myavg_iterateTERMINATE myavg_return
INITIALIZE: gives an initial value to the aggregate ITERATE : computes the intermediate aggregate value for each
new record TERMINATE: returns the final value computed for the aggregate myavg_init, myavg_iterate, myavg_return are 3 functions that the user
must write in a procedural programming language
8
Limitation of SQL3 UDAs
UDAs in SQL3, Postgres, Informix, and early versions of LDL++ share the same limitations:
Aggregates can not be used inside recursion No support for early returns and on-line
aggregation Also Ease of Use is a major issue (except for LDL+
+)
9
Ease of Use THE PROBLEM: UDFs are very hard to write and debug. In
“unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem.
A SOLUTION: Use a high-level language for defining UDAs. But there are many potential problems with any new language.
THE IDEAL SOLUTION: use SQL to define new aggregates. Substantial benefits:
Users are already familiar with SQL No impedance mismatch of data types and programming
paradigms DB advantages: scalability, data independence, optimizability,
parallelizability
10
Simple AggregatesAGGREGATE avg(value INT) : REAL{ TABLE state(sum INT, cnt INT);
INITIALIZE: { INSERT INTO state (value, 1);
} ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; }}
11
Avoiding Multiple Scans Show the average salary of senior managers who
make 3 times more than the average employees.
SQL: SELECT avg(salary)
FROM employeeWHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee)
Two scans of the employee table required
With AXL UDAs: SELECT sscan(title, salary)
FROM employee
12
AXL: Using a Single ScanAGGREGATE sscan(title CHAR(20), salary INT) : REAL{ TABLE state(sum INT, cnt INT) AS VALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state); }}
13
Ordered Sequences and Time Series
We have a sequence of events, each of which is active during a certain interval (from, end). Find out at which point of time we have the largest number of active events.
SQL: Group-by on the start time of each interval, and count!
With AXL UDAs: SELECT density(from, end)
FROM events
14
AXL: Using a Single ScanAGGREGATE density(start TIME, end TIME) : (time TIME, count INT){ TABLE state(time TIME, count INT) AS (0, 0);
TABLE active(endpoint TIME);
INITIALIZE: ITERATE: {DELETE FROM active WHERE endpoint < start;INSERT INTO active VALUES(end);UPDATE state SET time=start, count =count + 1
WHERE count < (SELECT count(*) FROM active);}TERMINATE: {
INSERT INTO RETURN SELECT time, count FROM state;
}}
16
Early Returns AVG normally converges early: an early
approximation is all is needed in several applications
Online aggregation means that early returns are produced during the computation
Many applications: e.g., find the local max and min in a sequence of values, various temporal aggregates, rollups, etc.
These might depend on the order – same as new OLAP extensions in SQL3
17
Return avg for Every 100 Records
AGGREGATE olavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: {
UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN
SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: {
INSERT INTO RETURN SELECT sum/cnt FROM state; }}
18
Temporal CoalescingAGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME){ TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: {
UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from;
UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: {
INSERT INTO RETURN SELECT cFrom, cTo FROM state; }}
19
Recursive Aggregates In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively.
AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: {
INSERT INTO RETURN VALUES(P);
INSERT INTO RETURN SELECT alldesc(Child) FROM childrenWHERE Parent = P;
}}
Find all the descendents of Tom:
SELECT alldesc(Child)FROM childrenWHERE Parent = ‘Tom’;
20
Check Point Simple applications: AXL UDAs provide a
solution with better performance and good ease of use.
Many advanced database applications Time Series, Temporal Database, Spatial
Database… In particular data mining
applications
21
Data Mining and Database Systems
Current Approach: Cursor-based: loose-coupling, stored
procedures UDFs: ease of use problems
Cache-Mine: Using DBMSs as containers of data
Many attempts to closely integrate data mining functions into DBMS have shown major problems
What we need …What we need …
SQL-aware Data Mining SQL-aware Data Mining SystemsSystems
Surajit Chaudhuri “Data Mining and Database Systems: Where is the Intersection?” IEEE Data Engineering Bulletin, 1998
25
Decision Tree ClassifiersOutlook Temp Humidity Wind PlayTennisSunny Hot High Weak NoSunny Hot High Strong Yes
Overcast Hot High Weak YesRain Mild High Weak YesRain Cool Normal Weak YesRain Cool Normal Strong Yes
Overcast Cool Normal Strong NoSunny Mild High Weak NoSunny Cool Normal Weak YesRain Mild Normal Weak Yes
Sunny Mild Normal Strong YesOvercast Mild High Strong YesOvercast Hot Normal Weak Yes
Rain Mild High Strong no
RecId ColId Value PlayTennis1 1 Sunny No1 2 Hot No1 3 High No1 4 Weak No2 1 Sunny Yes2 2 Hot Yes2 3 High Yes2 4 Strong Yes
14 1 Rain No14 2 Mild No14 3 High No14 4 Strong No
Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)
26
Convert training set to column/value pairs
AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT){
INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn); }}
CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT)SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis)FROM tennis;
SELECT classify(recId, col, val, YorN)FROM col-val-pairs;
Categorical Classifier in AXL[ 1] AGGREGATE classify(RecId INT, iNode INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(RecId INT,Node INT, Col INT, Val REAL, YorN INT);[ 3] TABLE summary(Col INT, Value INT, Yc INT, Nc INT, KEY {Col, Value});[ 4] TABLE mincol(Col INT, MinGini REAL);[ 5] TABLE ginitable(Col INT, Gini REAL);[ 6] INITIALIZE: ITERATE: { [ 7] INSERT INTO treenodes VALUES (RecId, iNode, iCol, iValue, iYorN);[ 8] UPDATE summary[ 9] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN WHERE Col = iCol AND Value = iValue; [10] INSERT INTO summary SELECT iCol, iValue, iYorN, 1-iYorN WHERE SQLCDE=0; [11] }[12] TERMINATE: {[13] INSERT INTO ginitable SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc)[14] FROM summary GROUP BY Col;[15] INSERT INTO mincol SELECT minpointvalue(Col, Gini) FROM ginitable; [16] INSERT INTO result SELECT iNode, Col FROM mincol; [17] SELECT classify(t.RecId, t.Node*MAXVALUE+m.Value+1, t.Col, t.Value, t.YorN) [18] FROM treenodes AS t, [19] (SELECT tt.RecId RecId, tt.Value Value FROM treenodes tt, mincol m[20] WHERE tt.Col=m.Col AND m.MinGini>0 ) AS m [21] WHERE t.RecId = m.RecId GROUP BY m.Value;[22] }
[23] }
28
Performance
SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C
SPRINT Algorithm in AXL[ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value));[ 3] TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT);[ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId));[ 5] TABLE mincol(Col INT, Val REAL, Gini REAL);[ 6] TABLE node(Node INT) AS VALUES(iNode);[ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN);[ 9] UPDATE summary[10] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal)[11] WHERE Col=iCol;[12] }[13] TERMINATE: {[14] INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary;[15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n;[16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m[17] WHERE t.Col = m.Col AND m.Gini > 0;[18] SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN)[19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec[20] GROUP BY s.LeftOrRight;[21] }
[22] }
30
Comparison with Other Architectures
Integrating Association Rule Mining with Relational Database Systems
S. Sarawagi et al SIGMOD 98
0
1000
2000
3000
4000
Cache S-Proc UDF SQL
Tim
e in
Sec Pass 4
Pass 3Pass 2Pass 1
Datasets with 6.6 million records, support level .25%.
31
Implementation of AXL
.axl.axl
.cc.cc
.exe.exe
Berkeley DBBerkeley DB
.axl.axl
.cc.cc
UDFsUDFs
DB2 .libDB2 .lib
SQLSQL
DB2DB2
Standalone ModeStandalone Mode DB2 Add-on ModeDB2 Add-on Mode
32
Implementation of AXL Open interface of physical data model.
Currently using Berkeley DB as our storage manager In memory tables
Limited Optimization Using B+-Tree indexes to support equality/range query Predicate push-down / push-up
User Defined Aggregates Hash based Return multiple rows: ‘early return’ Return multiple columns: employee’s name and salary
33
Implementation of AXL (cont’d) Non-blocking aggregation
Keeping the state of aggregation between calls to the aggregate routines
Local tables defined inside aggregation are passed as parameters to the aggregates
Explicit sorting (and implicit hash-based aggregation)
AXL V1.2: above 30,000 lines of code
34
Check Point Simple applications: AXL UDAs provide
a solution with better performance and good ease of use.
Data Mining applications Formal Semantics of Aggregates
and Monotonic Aggregation
35
Aggregates in Recursion
Stratification: shaves(barber, X) :- villager(X), shaves(X, X). villager(barber). Aggregates:
p count(p) =0 Aggregates in many applications are
actually monotonic (and should be allowed inside recursion).
36
Beyond Stratification Significant previous efforts…
I. S. Mumick, H. Pirahesh and R. Ramakrishnan, “The magic of duplicates and aggregates”, VLDB 1990
A. Van Gelder, “Foundations of aggregates in deductive databases”, DOOD 1993
K. A. Ross and Y. Sagiv, “Monotonic aggregates in deductive databases”, JCSS 1997
S. Greco and C. Zaniolo, “Greedy algorithms in Datalog with choice and negation”, JICSLP 1998
37
Formal Semantics of Aggregates
choice((X),(Y)) Enforcing functional dependency. FD: X->Y Multiplicity of stable models, monotonic transformation
Ordering a domain Once (X,Y) is generated, choice ensures this is the only
arc leaving source node X and entering sink node Y Formal semantics of UDA
…return(Y,V) :- ordered(X,Y), ordered(Y,_), terminate(Y,V).…
38
Early Returns Monotonic Aggregates
AGGREGATE mcount(): INT{ TABLE state(cnt INT) AS VALUES (0); INITIALIZE: ITERATE: { UPDATE state SET cnt=cnt+1; INSERT INTO RETURN SELECT cnt FROM state; } }
Aggregates with only ‘early returns’ and no ‘final returns’ are monotonic w.r.t. set containment:
39
Early Returns Monotonic Aggregates
SELECT mcount(*) FROM employee; v.s.SELECT count(*) FROM employee;
NameJohnMary
Tom
NameJohnMaryTomJerry
mcount{John, Mary, Tom} {1,2,3} count{John, Mary, Tom} 3
mcount{John, Mary, Tom, Jerry} {1,2,3,4} count{John, Mary, Tom, Jerry} 4
40
Return sum at the nth valueAGGREGATE sumat(value INT, n INT): INT
{
TABLE state (sum INT, cnt INT) AS VALUES (0,0);
INITIALIZE: ITERATE: {
UPDATE state SET sum=sum+value, cnt=cnt +1;
INSERT INTO RETURN
SELECT sum FROM state WHERE cnt = n;
}
}
42
Monotonic Aggregation Monotonic aggregates can be used without
any restriction and without changing the underlying implementation.
This solves the problem that had eluded database researchers since the introduction of relational systems: BoM, Company Control, Join-the-Party… Greedy optimization algorithms, such as
Dijkstra’s single source shortest path.
43
Join-the-Party ProblemSome people will come to the party no matter what, and their names are stored in a sure(PName) relation. But many other people will join only after they know that at least K=3 of their friends will be there.
WITH wllcm(Name) AS((SELECT Pname FROM sure)UNION ALL (SELECT f.Pname FROM friend AS f, wllcm AS w WHERE w.Name =f.Fname GROUP BY f.Pname HAVING mcount()=3))SELECT Name FROM wllcm;
Density-based Clustering [M. Ester et al. KDD 96]
44
BoM: Cost of Parts001
002
005
003
006
004
15 1
100
10 52 3
8
Basic Part Cost
005 10006 15
Part ChC
001 4002 1003 2004 3
Part Subpart Qty
001 002 5001 003 1001 004 1001 005 100002 005 10003 006 2003 005 5004 006 3004 005 8004 003 3
assemblyassembly
part-costpart-cost
fan-outfan-out
3
45
BoM: Using AXLWITH cst(part, cost) AS((SELECT part, cost FROM part-cost) UNION ALL (SELECT a.part,
sumat(a.qty * c.cost, p.ChC) FROM assembly AS a, cst AS c, fan-out AS p WHERE a.subpart = c.part AND p.part = a.part GROUP BY a.part))SELECT part, costFROM cst;
Bottom up solution. Computes the cost for each
part once and only once. Monotonic sumat(cost, n)
returns sum when exactly n items are aggregated.
Works in DB2 after AXL rewrites callings of sumat() to callings of automatically generated UDFs.
46
BoM: Using Recursive SQLWITH mpath(subpart, qty) AS
((SELECT subpart, qty
FROM assembly
WHERE part = ‘001’)
UNION ALL
(SELECT c.subpart, m.qty * c.qty
FROM mpath m, assembly c
WHERE m.subpart = c.part))
SELECT sum(m.qty * c.cost)
FROM mpath m, part_cost c
WHERE m.subpart = c.part ;
Top down solution: computes the cost of part ‘001’
Explosion: all edges that descend from part ‘001’ are “stored” in mpath
What if we want to compute the cost of each part?
47
Check Point Simple applications Data Mining and Decision Support Formal Semantics & Monotonic aggregates OLAP and other aggregate extensions
SUCH THAT CUBE, ROLLUP, GROUPING SET OLAP Functions
48
“Such That”
For each division, show the average salary of senior managers who make 3 times more than the average employees, and the average salary of senior engineers who make 2 times more than the average employees (in the same output record).
D. Chatziantoniou, Kenneth Ross, VLDB 1996
SELECT division, avg(X.salary),
avg(Y.salary)
FROM employee
GROUP BY division: X, Y
SUCH THAT X.title = ‘senior manager’ AND X.salary > 3*avg(salary) AND Y.title = ‘senior engineer’ AND Y.salary > 2*avg(salary)
49
Expressing “Such That” in AXL
TABLE seniors(salary INT);
AGGREGATE sscan2(title CHAR(20), salary INT, qtitle CHAR(20), ratio INT): REAL{ TABLE state(sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: {
UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES (salary) WHERE title = qtitle;
} TERMINATE: { SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > ratio * (SELECT sum/cnt FROM state); }}
50
Using UDA sscan2 SELECTdivision,
sscan2(title, salary, ‘senior manager’, 3),sscan2(title, salary, ‘senior engineer’, 2)
FROM employeeGROUP BY division;
No joins or sub-queries required. One pass through the employee relation (standard
SQL requires at least 2 passes).
51
Other Aggregate Extensions GROUPING SETS, ROLL-UP, CUBE New OLAP extensions
Windows containing a partitioning, an ordering of rows and an aggregate group
“… every standard must be prepared to tackle new issues that arise as the market evolves. If SQL does not respond positively to this challenge, SQL risks becoming irrelevant …” -- F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle Introduction to OLAP Functions
56
Contributions
Adding extended UDAs to O-R systems high level language, minimal additions to SQL monotonic aggregates, recursive aggregates designed for general purpose applications
Tightly couple data mining functions with DBMS SPRINT Algorithm, Categorical Classifier, …
Performance More efficient than cursor-based languages like
PL/SQL, JDBC and UDF-based approaches …
57
Future Directions Parallelization Extenders/Data Blades
build on top of UDAs instead of UDFs Decision Support
the Apriori algorithm Windows and OLAP functions
Spatial/Temporal extensions the TENOR system
58
Future Direction: Parallelization
Current parallel aggregation algorithms valid for AXL: Inter-Partition parallelism: all tuples of the same group-by
value are in one node Two phase algorithm: user provides a COMBINE routine
Unlike SQL3, AXL’s aggregate routines are written in SQL, so we can apply traditional query parallelization techniques to INITIALIZE, ITERATE, and TERMINATE
Since aggregate routines are written in SQL, the COMBINE routine can be generated automatically by the system for simple UDAs