User-Defined Aggregates for Advanced Database Applications Haixun Wang [email protected] Computer Science Dept. University of California, Los Angeles

User-Defined Aggregates for Advanced Database Applications

Haixun [email protected]

Computer Science Dept.University of California, Los Angeles

2

State of the Art: the big picture Relational Databases:

Single data model of spartan simplicity Logic-based database languages Commercial success and dominance through SQL

standards (in the 80’s) A new wave of DB applications (in the 90’s)

Knowledge discovery, Data Mining, OLAP, Time Series analysis, Spatial/Temporal DBs, XML, …

Underscores limitations of RDBMS Prompted major extensions leading to SQL3

standards (SQL99 is a subset of SQL3)

3

State of the Art: R&D highlights Deductive DBs

Rule based syntax Logic based formalization of query language

semantics – e.g., nonmonotonic reasoning and stratification

Recursive queries OO-DBs

Complex data types / inheritance Expressive power by merging PL and query languages

OR-DBs Rich data types / Path Expression (SQL) UDFs and Database Extenders (Data Blades)

4

State of the Art: the seamy side A patchwork of major extensions

DBMSs have become more powerful but much hard and complex to build and use

Still not powerful enough Data Mining: clustering, classification, association New language constructs not helping either

Limited expressive power in other applications Bill of Materials (BoM) type of applications Temporal reasoning

5

This thesisThis thesis: Many of the problems can be solved by UDAs

User Defined Aggregates (UDAs): insufficient support in commercial world and

DB standards Our claim: UDAs provide a more general

and powerful mechanism for DB extensions AXL – a system to make it easier to define

UDAs AXL – where SQL and Data Mining intersect

6

A Brief History of AXL (and of my thesis)

Logic formalization of aggregates [DDLP’98, LBAI’00] Early returns, monotonic aggregates: used freely in recursive queries.

Extensions of LDL++ (Logic Database Language) SQL-AG: Implementing and extending SQL3 UDAs [DBPL’99]

Implemented on DB2 using extended user-defined aggregates with ‘early returns’

SADL: Simple Aggregate Definition Language [ICDE’00] using SQL to define new aggregates easy to use, but with limited performance and power

AXL: Aggregate eXtension Language [VLDB’00] powerful, efficient and still SQL-based

7

Defining UDAs in SQL3AGGREGATE FUNCTION myavg(val NUMBER)RETURN NUMBERSTATE stateINITIALIZE myavg_initITERATE myavg_iterateTERMINATE myavg_return

INITIALIZE: gives an initial value to the aggregate ITERATE : computes the intermediate aggregate value for each

new record TERMINATE: returns the final value computed for the aggregate myavg_init, myavg_iterate, myavg_return are 3 functions that the user

must write in a procedural programming language

8

Limitation of SQL3 UDAs

UDAs in SQL3, Postgres, Informix, and early versions of LDL++ share the same limitations:

Aggregates can not be used inside recursion No support for early returns and on-line

aggregation Also Ease of Use is a major issue (except for LDL+

+)

9

Ease of Use THE PROBLEM: UDFs are very hard to write and debug. In

“unfenced mode” they jeopardize the integrity of the system. UDAs defined using several UDFs are prone to the same problem.

A SOLUTION: Use a high-level language for defining UDAs. But there are many potential problems with any new language.

THE IDEAL SOLUTION: use SQL to define new aggregates. Substantial benefits:

Users are already familiar with SQL No impedance mismatch of data types and programming

paradigms DB advantages: scalability, data independence, optimizability,

parallelizability

10

Simple AggregatesAGGREGATE avg(value INT) : REAL{ TABLE state(sum INT, cnt INT);

INITIALIZE: { INSERT INTO state (value, 1);

} ITERATE: { UPDATE state SET sum=sum+value, cnt=cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT sum/cnt FROM state; }}

11

Avoiding Multiple Scans Show the average salary of senior managers who

make 3 times more than the average employees.

SQL: SELECT avg(salary)

FROM employeeWHERE title = ‘senior manager’ AND salary > 3 * (SELECT avg(salary) FROM employee)

Two scans of the employee table required

With AXL UDAs: SELECT sscan(title, salary)

FROM employee

12

AXL: Using a Single ScanAGGREGATE sscan(title CHAR(20), salary INT) : REAL{ TABLE state(sum INT, cnt INT) AS VALUES (0,0); TABLE seniors(salary INT); INITIALIZE: ITERATE: { UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES(salary) WHERE title = ‘senior manager’; } TERMINATE: { INSERT INTO RETURN SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > 3 * (SELECT sum/cnt FROM state); }}

13

Ordered Sequences and Time Series

We have a sequence of events, each of which is active during a certain interval (from, end). Find out at which point of time we have the largest number of active events.

SQL: Group-by on the start time of each interval, and count!

With AXL UDAs: SELECT density(from, end)

FROM events

14

AXL: Using a Single ScanAGGREGATE density(start TIME, end TIME) : (time TIME, count INT){ TABLE state(time TIME, count INT) AS (0, 0);

TABLE active(endpoint TIME);

INITIALIZE: ITERATE: {DELETE FROM active WHERE endpoint < start;INSERT INTO active VALUES(end);UPDATE state SET time=start, count =count + 1

WHERE count < (SELECT count(*) FROM active);}TERMINATE: {

INSERT INTO RETURN SELECT time, count FROM state;

}}

16

Early Returns AVG normally converges early: an early

approximation is all is needed in several applications

Online aggregation means that early returns are produced during the computation

Many applications: e.g., find the local max and min in a sequence of values, various temporal aggregates, rollups, etc.

These might depend on the order – same as new OLAP extensions in SQL3

17

Return avg for Every 100 Records

AGGREGATE olavg(value INT): REAL { TABLE state(sum INT, cnt INT); INITIALIZE: { INSERT INTO state VALUES (value,1); } ITERATE: {

UPDATE state SET sum=sum+value, cnt=cnt+1; INSERT INTO RETURN

SELECT sum/cnt FROM state WHERE cnt MOD 100 = 0; } TERMINATE: {

INSERT INTO RETURN SELECT sum/cnt FROM state; }}

18

Temporal CoalescingAGGREGATE coalesce(from TIME, to TIME): (start TIME, end TIME){ TABLE state(cFrom TIME, cTo TIME); INITIALIZE: { INSERT INTO state VALUES (from, to); } ITERATE: {

UPDATE state SET cTo = to WHERE cTo >= from AND cTo < to; INSERT INTO RETURN SELECT cFrom, cTo FROM state WHERE cTo < from;

UPDATE state SET cFrom = from, cTo = to WHERE cTo < from; } TERMINATE: {

INSERT INTO RETURN SELECT cFrom, cTo FROM state; }}

19

Recursive Aggregates In AXL, aggregates can call other aggregates. Particularly, an aggregate can call itself recursively.

AGGREGATE alldesc(P CHAR(20)): CHAR(20) { INITIALIZE: ITERATE: {

INSERT INTO RETURN VALUES(P);

INSERT INTO RETURN SELECT alldesc(Child) FROM childrenWHERE Parent = P;

}}

Find all the descendents of Tom:

SELECT alldesc(Child)FROM childrenWHERE Parent = ‘Tom’;

20

Check Point Simple applications: AXL UDAs provide a

solution with better performance and good ease of use.

Many advanced database applications Time Series, Temporal Database, Spatial

Database… In particular data mining

applications

21

Data Mining and Database Systems

Current Approach: Cursor-based: loose-coupling, stored

procedures UDFs: ease of use problems

Cache-Mine: Using DBMSs as containers of data

Many attempts to closely integrate data mining functions into DBMS have shown major problems

What we need …What we need …

SQL-aware Data Mining SQL-aware Data Mining SystemsSystems

Surajit Chaudhuri “Data Mining and Database Systems: Where is the Intersection?” IEEE Data Engineering Bulletin, 1998

25

Decision Tree ClassifiersOutlook Temp Humidity Wind PlayTennisSunny Hot High Weak NoSunny Hot High Strong Yes

Overcast Hot High Weak YesRain Mild High Weak YesRain Cool Normal Weak YesRain Cool Normal Strong Yes

Overcast Cool Normal Strong NoSunny Mild High Weak NoSunny Cool Normal Weak YesRain Mild Normal Weak Yes

Sunny Mild Normal Strong YesOvercast Mild High Strong YesOvercast Hot Normal Weak Yes

Rain Mild High Strong no

RecId ColId Value PlayTennis1 1 Sunny No1 2 Hot No1 3 High No1 4 Weak No2 1 Sunny Yes2 2 Hot Yes2 3 High Yes2 4 Strong Yes

14 1 Rain No14 2 Mild No14 3 High No14 4 Strong No

Training set: tennis Stream of Column/Value Pairs (together with RecId and Category)

26

Convert training set to column/value pairs

AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4 INT, yorn INT) : (col INT, val INT, YorN INT){

INITIALIZE: ITERATE: { INSERT INTO RETURN VALUES(1, v1, yorn), (2,v2,yorn), (3,v3,yorn), (4,v4,yorn); }}

CREATE VIEW col-val-pairs(recId INT, col INT, val INT, YorN INT)SELECT mcount(), dissemble(Outlook, Temp, Humidity, Wind, PlayTennis)FROM tennis;

SELECT classify(recId, col, val, YorN)FROM col-val-pairs;

Categorical Classifier in AXL[ 1] AGGREGATE classify(RecId INT, iNode INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(RecId INT,Node INT, Col INT, Val REAL, YorN INT);[ 3] TABLE summary(Col INT, Value INT, Yc INT, Nc INT, KEY {Col, Value});[ 4] TABLE mincol(Col INT, MinGini REAL);[ 5] TABLE ginitable(Col INT, Gini REAL);[ 6] INITIALIZE: ITERATE: { [ 7] INSERT INTO treenodes VALUES (RecId, iNode, iCol, iValue, iYorN);[ 8] UPDATE summary[ 9] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN WHERE Col = iCol AND Value = iValue; [10] INSERT INTO summary SELECT iCol, iValue, iYorN, 1-iYorN WHERE SQLCDE=0; [11] }[12] TERMINATE: {[13] INSERT INTO ginitable SELECT Col, sum((Yc*Nc)/(Yc+Nc))/sum(Yc+Nc)[14] FROM summary GROUP BY Col;[15] INSERT INTO mincol SELECT minpointvalue(Col, Gini) FROM ginitable; [16] INSERT INTO result SELECT iNode, Col FROM mincol; [17] SELECT classify(t.RecId, t.Node*MAXVALUE+m.Value+1, t.Col, t.Value, t.YorN) [18] FROM treenodes AS t, [19] (SELECT tt.RecId RecId, tt.Value Value FROM treenodes tt, mincol m[20] WHERE tt.Col=m.Col AND m.MinGini>0 ) AS m [21] WHERE t.RecId = m.RecId GROUP BY m.Value;[22] }

[23] }

28

Performance

SPRINT Algorithm: AXL vs. C Categorical Classifier: AXL vs. C

SPRINT Algorithm in AXL[ 1] AGGREGATE sprint(iNode INT, iRec INT, iCol INT, iValue REAL, iYorN INT) [ 2] { TABLE treenodes(Rec INT, Col INT, Val REAL, YorN INT, KEY(Col, Value));[ 3] TABLE summary(Col INT, SplitGini REAL, SplitVal REAL, Yc INT, Nc INT);[ 4] TABLE split(Rec INT, LeftOrRight INT, KEY (RecId));[ 5] TABLE mincol(Col INT, Val REAL, Gini REAL);[ 6] TABLE node(Node INT) AS VALUES(iNode);[ 7] INITIALIZE: ITERATE: { [ 8] INSERT INTO treenodes VALUES (iRec, iCol, iValue, iYorN);[ 9] UPDATE summary[10] SET Yc=Yc+iYorN, Nc=Nc+1-iYorN, (SplitGini, SplitVal) = giniudf(Yc, Nc, N, SplitGini, SplitVal)[11] WHERE Col=iCol;[12] }[13] TERMINATE: {[14] INSERT INTO mincol SELECT minpointvalue(Col, SplitGini, SplitVal) FROM summary;[15] INSERT INTO result SELECT n.Node, m.Col, m.Value FROM mincol AS m, node AS n;[16] INSERT INTO split SELECT t.Rec, (t.Value>m.Value) FROM treenodes AS t, mincol AS m[17] WHERE t.Col = m.Col AND m.Gini > 0;[18] SELECT sprint(n.Node*2+s.LeftOrRight, t.Rec, t.Col, t.Val, t.YorN)[19] FROM treenodes AS t, split AS s, node AS n WHERE t.Rec = s.Rec[20] GROUP BY s.LeftOrRight;[21] }

[22] }

30

Comparison with Other Architectures

Integrating Association Rule Mining with Relational Database Systems

S. Sarawagi et al SIGMOD 98

0

1000

2000

3000

4000

Cache S-Proc UDF SQL

Tim

e in

Sec Pass 4

Pass 3Pass 2Pass 1

Datasets with 6.6 million records, support level .25%.

31

Implementation of AXL

.axl.axl

.cc.cc

.exe.exe

Berkeley DBBerkeley DB

.axl.axl

.cc.cc

UDFsUDFs

DB2 .libDB2 .lib

SQLSQL

DB2DB2

Standalone ModeStandalone Mode DB2 Add-on ModeDB2 Add-on Mode

32

Implementation of AXL Open interface of physical data model.

Currently using Berkeley DB as our storage manager In memory tables

Limited Optimization Using B+-Tree indexes to support equality/range query Predicate push-down / push-up

User Defined Aggregates Hash based Return multiple rows: ‘early return’ Return multiple columns: employee’s name and salary

33

Implementation of AXL (cont’d) Non-blocking aggregation

Keeping the state of aggregation between calls to the aggregate routines

Local tables defined inside aggregation are passed as parameters to the aggregates

Explicit sorting (and implicit hash-based aggregation)

AXL V1.2: above 30,000 lines of code

34

Check Point Simple applications: AXL UDAs provide

a solution with better performance and good ease of use.

Data Mining applications Formal Semantics of Aggregates

and Monotonic Aggregation

35

Aggregates in Recursion

Stratification: shaves(barber, X) :- villager(X), shaves(X, X). villager(barber). Aggregates:

p count(p) =0 Aggregates in many applications are

actually monotonic (and should be allowed inside recursion).

36

Beyond Stratification Significant previous efforts…

I. S. Mumick, H. Pirahesh and R. Ramakrishnan, “The magic of duplicates and aggregates”, VLDB 1990

A. Van Gelder, “Foundations of aggregates in deductive databases”, DOOD 1993

K. A. Ross and Y. Sagiv, “Monotonic aggregates in deductive databases”, JCSS 1997

S. Greco and C. Zaniolo, “Greedy algorithms in Datalog with choice and negation”, JICSLP 1998

37

Formal Semantics of Aggregates

choice((X),(Y)) Enforcing functional dependency. FD: X->Y Multiplicity of stable models, monotonic transformation

Ordering a domain Once (X,Y) is generated, choice ensures this is the only

arc leaving source node X and entering sink node Y Formal semantics of UDA

…return(Y,V) :- ordered(X,Y), ordered(Y,_), terminate(Y,V).…

38

Early Returns Monotonic Aggregates

AGGREGATE mcount(): INT{ TABLE state(cnt INT) AS VALUES (0); INITIALIZE: ITERATE: { UPDATE state SET cnt=cnt+1; INSERT INTO RETURN SELECT cnt FROM state; } }

Aggregates with only ‘early returns’ and no ‘final returns’ are monotonic w.r.t. set containment:

39

Early Returns Monotonic Aggregates

SELECT mcount(*) FROM employee; v.s.SELECT count(*) FROM employee;

NameJohnMary

Tom

NameJohnMaryTomJerry

mcount{John, Mary, Tom} {1,2,3} count{John, Mary, Tom} 3

mcount{John, Mary, Tom, Jerry} {1,2,3,4} count{John, Mary, Tom, Jerry} 4

40

Return sum at the nth valueAGGREGATE sumat(value INT, n INT): INT

{

TABLE state (sum INT, cnt INT) AS VALUES (0,0);

INITIALIZE: ITERATE: {

UPDATE state SET sum=sum+value, cnt=cnt +1;

INSERT INTO RETURN

SELECT sum FROM state WHERE cnt = n;

}

}

42

Monotonic Aggregation Monotonic aggregates can be used without

any restriction and without changing the underlying implementation.

This solves the problem that had eluded database researchers since the introduction of relational systems: BoM, Company Control, Join-the-Party… Greedy optimization algorithms, such as

Dijkstra’s single source shortest path.

43

Join-the-Party ProblemSome people will come to the party no matter what, and their names are stored in a sure(PName) relation. But many other people will join only after they know that at least K=3 of their friends will be there.

WITH wllcm(Name) AS((SELECT Pname FROM sure)UNION ALL (SELECT f.Pname FROM friend AS f, wllcm AS w WHERE w.Name =f.Fname GROUP BY f.Pname HAVING mcount()=3))SELECT Name FROM wllcm;

Density-based Clustering [M. Ester et al. KDD 96]

44

BoM: Cost of Parts001

002

005

003

006

004

15 1

100

10 52 3

8

Basic Part Cost

005 10006 15

Part ChC

001 4002 1003 2004 3

Part Subpart Qty

001 002 5001 003 1001 004 1001 005 100002 005 10003 006 2003 005 5004 006 3004 005 8004 003 3

assemblyassembly

part-costpart-cost

fan-outfan-out

3

45

BoM: Using AXLWITH cst(part, cost) AS((SELECT part, cost FROM part-cost) UNION ALL (SELECT a.part,

sumat(a.qty * c.cost, p.ChC) FROM assembly AS a, cst AS c, fan-out AS p WHERE a.subpart = c.part AND p.part = a.part GROUP BY a.part))SELECT part, costFROM cst;

Bottom up solution. Computes the cost for each

part once and only once. Monotonic sumat(cost, n)

returns sum when exactly n items are aggregated.

Works in DB2 after AXL rewrites callings of sumat() to callings of automatically generated UDFs.

46

BoM: Using Recursive SQLWITH mpath(subpart, qty) AS

((SELECT subpart, qty

FROM assembly

WHERE part = ‘001’)

UNION ALL

(SELECT c.subpart, m.qty * c.qty

FROM mpath m, assembly c

WHERE m.subpart = c.part))

SELECT sum(m.qty * c.cost)

FROM mpath m, part_cost c

WHERE m.subpart = c.part ;

Top down solution: computes the cost of part ‘001’

Explosion: all edges that descend from part ‘001’ are “stored” in mpath

What if we want to compute the cost of each part?

47

Check Point Simple applications Data Mining and Decision Support Formal Semantics & Monotonic aggregates OLAP and other aggregate extensions

SUCH THAT CUBE, ROLLUP, GROUPING SET OLAP Functions

48

“Such That”

For each division, show the average salary of senior managers who make 3 times more than the average employees, and the average salary of senior engineers who make 2 times more than the average employees (in the same output record).

D. Chatziantoniou, Kenneth Ross, VLDB 1996

SELECT division, avg(X.salary),

avg(Y.salary)

FROM employee

GROUP BY division: X, Y

SUCH THAT X.title = ‘senior manager’ AND X.salary > 3*avg(salary) AND Y.title = ‘senior engineer’ AND Y.salary > 2*avg(salary)

49

Expressing “Such That” in AXL

TABLE seniors(salary INT);

AGGREGATE sscan2(title CHAR(20), salary INT, qtitle CHAR(20), ratio INT): REAL{ TABLE state(sum INT, cnt INT) AS VALUES (0,0); INITIALIZE: ITERATE: {

UPDATE state SET sum=sum+salary, cnt=cnt+1; INSERT INTO seniors VALUES (salary) WHERE title = qtitle;

} TERMINATE: { SELECT avg(s.salary) FROM seniors AS s WHERE s.salary > ratio * (SELECT sum/cnt FROM state); }}

50

Using UDA sscan2 SELECTdivision,

sscan2(title, salary, ‘senior manager’, 3),sscan2(title, salary, ‘senior engineer’, 2)

FROM employeeGROUP BY division;

No joins or sub-queries required. One pass through the employee relation (standard

SQL requires at least 2 passes).

51

Other Aggregate Extensions GROUPING SETS, ROLL-UP, CUBE New OLAP extensions

Windows containing a partitioning, an ordering of rows and an aggregate group

“… every standard must be prepared to tackle new issues that arise as the market evolves. If SQL does not respond positively to this challenge, SQL risks becoming irrelevant …” -- F. Zemke, K. Kulkarni, A. Witkowski, B. Lyle Introduction to OLAP Functions

56

Contributions

Adding extended UDAs to O-R systems high level language, minimal additions to SQL monotonic aggregates, recursive aggregates designed for general purpose applications

Tightly couple data mining functions with DBMS SPRINT Algorithm, Categorical Classifier, …

Performance More efficient than cursor-based languages like

PL/SQL, JDBC and UDF-based approaches …

57

Future Directions Parallelization Extenders/Data Blades

build on top of UDAs instead of UDFs Decision Support

the Apriori algorithm Windows and OLAP functions

Spatial/Temporal extensions the TENOR system

58

Future Direction: Parallelization

Current parallel aggregation algorithms valid for AXL: Inter-Partition parallelism: all tuples of the same group-by

value are in one node Two phase algorithm: user provides a COMBINE routine

Unlike SQL3, AXL’s aggregate routines are written in SQL, so we can apply traditional query parallelization techniques to INITIALIZE, ITERATE, and TERMINATE

Since aggregate routines are written in SQL, the COMBINE routine can be generated automatically by the system for simple UDAs

Documents

User-Defined Aggregates for Advanced Database Applications Haixun Wang [email protected] Computer Science Dept. University of California, Los Angeles