62
1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin- Madison

1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

Embed Size (px)

DESCRIPTION

THE EDAM PROJECT University of Wisconsin-Madison 3 Mining at a Crossroads Data Mining has drawn upon ideas and people from many disciplines, and has grown rapidly. As yet, no unifying vision of how these disciplines leverage each other.  Stats folks still do stats, ML folks still do ML, DB folks still think about large datasets—and they rarely talk amongst each other. What are the applications that will pay the piper?

Citation preview

Page 1: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

1

Mining: A Database Perspective

Raghu RamakrishnanUniv. of Wisconsin-Madison

Page 2: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

2THE EDAM PROJECT University of Wisconsin-Madison

Data Mining Classification

Decision trees Regression SVMs Naïve Bayes Meta-learners, ensembles

Clustering K-means Hierarchical methods EM

MRDM/ILP pattern discovery Horn rules; PRMs

Frequent item analysis Associations, sequential patterns

Time-series analysis Linear and nonlinear dynamics

Collaborative filtering Text, multimedia mining

ML/AI

Optimization

DB Stats

Page 3: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

3THE EDAM PROJECT University of Wisconsin-Madison

Mining at a Crossroads

Data Mining has drawn upon ideas and people from many disciplines, and has grown rapidly.

As yet, no unifying vision of how these disciplines leverage each other. Stats folks still do stats, ML folks still do ML, DB folks

still think about large datasets—and they rarely talk amongst each other.

What are the applications that will pay the piper?

Page 4: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

4THE EDAM PROJECT University of Wisconsin-Madison

About this Talk

A database perspective on data mining and its relationship to data management How can database-oriented thinking influence

research and practice in data mining? What are the difficult problems with big payoffs?

The EDAM project at Wisconsin Analyzing streams of mass spectra and other

spatio-temporal data Joint work with researchers in atmospheric

aerosols and climatology at UW-Madison and Carleton College, funded by an NSF ITR

Page 5: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

5THE EDAM PROJECT University of Wisconsin-Madison

Outline

A Database perspective Recent extensions to relational systems

OLAP: Cube, sequence queriesData mining support

Relational approaches to miningRelational clusteringMRDM/ILP

The EDAM project

Page 6: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

6THE EDAM PROJECT University of Wisconsin-Madison

A Database Perspective

Page 7: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

7THE EDAM PROJECT University of Wisconsin-Madison

All the World’s a Table

All data is in a database. If not, it’s not important

Data mining is a class of analysis techniques that complements current SQL data analysis capabilities. Data is in a DBMS for reasons that go well beyond

the analysis capabilities of the DBMS, even if these are often inadequate.

And if the past is any indication, the DB vendors will try to expand SQL to support whatever DM capabilities the market will pay for—and it’s not clear that this is the right architecture.

Page 8: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

8THE EDAM PROJECT University of Wisconsin-Madison

Scalability Widely recognized as a characteristic DB concern, and

that it provides useful techniques to deal with scale. BIRCH—Scalable pre-clustering that borrows ideas from B+

trees Rainforest—Framework for scaling decision tree construction

that borrows from hash joins (There are also scalable algorithms based on EM and

Bootstrapping) However, the focus has been on one aspect of scale:

Size of training data We also need scalability with respect to other problem

dimensions: Size of hypothesis space Rate of data capture and analysis

Page 9: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

9THE EDAM PROJECT University of Wisconsin-Madison

Queries vs. Mining From the point of view of the user, SQL queries

are one way to explore and understand the data. But is it “data mining”? The various data mining techniques are no more (or

less) than alternatives with different capabilities. The query framework has some ideas worth

borrowing and generalizing: Compositionality—more flexibility, more automation Usability—domain analysts, not tool experts Query Optimization

Page 10: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

10THE EDAM PROJECT University of Wisconsin-Madison

A Different Mindset …

Sometimes, just looking at the problem from a different perspective may lead to useful reformulations: Frequent itemsets Relational clustering Stream analysis Labeling spectra Subset mining

“What does a query mean?” vs. “How do I characterize my data?” Hopefully, not mutually exclusive!

Can raise very different concerns E.g., Coverage, accuracy (ML), confidence bounds (Stats) vs.

query equivalence, compositionality (DB) Combining multiple sources of information (e.g., multiple tables)

Page 11: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

11THE EDAM PROJECT University of Wisconsin-Madison

Query Optimization Driven by user’s query

Goal is to find answers to this query efficiently Search space for optimization

Defined through equivalences to given query Exploits compositionality!

“Goodness” metric is estimated plan cost Contrast this with the search spaces typical in, e.g.,

rule discovery or attribute selection These are data-driven, not query-driven Search space based on hypothesis refinement “Goodness” metric based on coverage of training set

Page 12: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

12THE EDAM PROJECT University of Wisconsin-Madison

Data Management Management

Data storage and archival Privacy, sharing, collaboration

Focus has been on managing data; however: Queries can be stored in the DBMS Views, or tables defined by queries (Ownership, access control, re-optimization, caching)

We need more support for managing analyses: Managing analyses external to the DBMS Provenance of data and analysis Versioning and collaboration support Support for ongoing analyses: Impact of data changes

on analyses; monitoring; trend analysis over warehouses; deploying results into operational system

Page 13: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

13THE EDAM PROJECT University of Wisconsin-Madison

IndexerMiner

Files, Logs DBMS

RAID STORAGE

Warehouse

Data Co-Processor Architecture

Small readsLarge R/W

Periodicoffline activity

Queries/Searches

Page 14: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

14THE EDAM PROJECT University of Wisconsin-Madison

Updates

SQL Queries

OLAP Queries

Text Queries

SYNC

CUSTOMIZED ASYNCHRONOUS REPLICAS

Page 15: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

15THE EDAM PROJECT University of Wisconsin-Madison

Recent Extensions of Relational Queries

Page 16: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

16THE EDAM PROJECT University of Wisconsin-Madison

Star Schema

Transactions(timekey,storekey,pkey,promkey,ckey,units,price)

Time Store

Customers ProductsPromotions

Page 17: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

17THE EDAM PROJECT University of Wisconsin-Madison

Multidimensional Analysis

NY CA WI

Industry1 $1000 $2000 $1000

Industry2 $500 $1000 $500

Industry3 $3000 $3000 $3000

Industry

CategoryCountry=“USA”

State

City

YearQuarter

Month WeekDayProduct

Page 18: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

18THE EDAM PROJECT University of Wisconsin-Madison

Slice and Drill-Down

SanFrancisco

San Jose Los Angeles

Category1 $300 $300 $400

Category2 $300 $300 $400

Category3 $100 $800 $100

Industry=“Industry3”

Category

Country

State=“CA”

City

YearQuarter

Month WeekDayProduct

Page 19: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

THE EDAM PROJECT University of Wisconsin-Madison

Comparison with SQL

SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.city

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year

SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.city

Page 20: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

20THE EDAM PROJECT University of Wisconsin-Madison

Visual Intuition: Cube

Location

Product

TimeM T W Th F S S

Product1Product2Product3Product4Product5Product6

SHSF

LA

203020151050

50 Units of Product6 sold on Monday in LA

roll-up to week

roll-up to category

roll-up to state

Page 21: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

THE EDAM PROJECT University of Wisconsin-Madison

CUBE Operator

For k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.

CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets

of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:

SELECT SUM(S.sales)FROM Sales SGROUP BY grouping-list

Page 22: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

22THE EDAM PROJECT University of Wisconsin-Madison

Observation

When you need to consider several related or overlapping computationsThink of how to expose this space to the

user, and to get user input on what part of the space might be interestingMarketing specialists can use OLAP interfaces to

do very complex queries easilyThink of how to optimize by exploiting

commonality across computations

Page 23: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

THE EDAM PROJECT University of Wisconsin-Madison

Querying Sequences SQL-92 supports queries over relations.

A relation is a (multi) set of records. No ordering of records in a relation!

Queries involving order are hard or impossible to express, and typically, inefficiently evaluated. Find weekly moving average of the DJIA. Compute % change of each stock during ‘97, and then find

stocks in the top 5% (those that changed most). SQL:1999 supports the concept of windowing, which

effectively orders tuples for query purposes.

Page 24: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

THE EDAM PROJECT University of Wisconsin-Madison

SRQL(Ramakrishnan et al., SSDBM 98)

Proposed a sequencing operator as an extension to relational algebra.

g s v3 4 a3 6 b3 6 c3 9 b2 1 a4 3 d

Applied to a table R, with grouping attrs g and sequencing attrs s, it returns the corresponding composite sequence.

ord g s v1 3 4 a2 3 6 b2 3 6 c3 3 9 b1 2 1 a1 4 3 d

Page 25: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

THE EDAM PROJECT University of Wisconsin-Madison

Find the 2-day moving average of volume sold for each product: In effect, creates a sequence by day for each product,

and computes the moving average over each of these sequences.

Observe how this generalizes SQL’s GROUP BY: illustrates power of composite sequences and aggregation.

SELECT product, day, AVG(vol) OVER 0 TO 1FROM SalesGROUP BY productSEQUENCE BY day

Example

Page 26: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

THE EDAM PROJECT University of Wisconsin-Madison

Variants of Aggregation

We can now introduce “running sum” and other cumulative aggregate functions!OVER FIRST TO 0: This gives us “running”

or “cumulative” aggregates.RANK() is CUMULATIVE COUNT(*)PERCENTILE() is (RANK()/COUNT(*))*100

Elegant way to express concepts like “give me the first few answers”.

SQL:1999 does all this and more (different syntax)

Page 27: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

27THE EDAM PROJECT University of Wisconsin-Madison

Observation

Still much more limited than time-series analysis and mining techniques available elsewhere

No support for streams

Page 28: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

28THE EDAM PROJECT University of Wisconsin-Madison

DBMS Support for Managing Mining Models

Page 29: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

29THE EDAM PROJECT University of Wisconsin-Madison

Why Integrate?

Data

Copy

Extract

Models

Consistency?

Mine

Page 30: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

30THE EDAM PROJECT University of Wisconsin-Madison

Integration Objectives

Avoid isolation of querying from mining Difficult to do “ad-hoc”

mining Provide simple

programming approach to creating and using DM models

Make it possible to add new models

Make it possible to add new, scalable algorithms

Analysts (users) DM Vendors

Page 31: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

31THE EDAM PROJECT University of Wisconsin-Madison

DM Concepts to Support

Representation of input (cases) Representation of models Specification of training step Specification of prediction step

Should be independent of specific algorithms

Page 32: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

32THE EDAM PROJECT University of Wisconsin-Madison

Types of Columns

Cust ID Age

MaritalStatus

WealthProduct PurchasesProduct Quantity Type

1 35 M 380,000

TV 1 Appliance

Coke 6 DrinkHam 3 Food Keys: Columns that uniquely identify a case

Attributes: Columns that describe a case Value: A state associated with the attribute in a specific case Attribute Property: Columns that describe an attribute

Unique for a specific attribute value (TV is always an appliance) Attribute Modifier: Columns that represent additional “meta” information for

an attribute Weight of a case, Certainty of prediction

Single case!

Page 33: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

33THE EDAM PROJECT University of Wisconsin-Madison

Representing a DMM

Specifying a Model Columns it should predict Algorithm to use Special parameters

Model is represented as a nested table Specification = Create table Training = Inserting data into the table Predicting = Querying the table

Page 34: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

34THE EDAM PROJECT University of Wisconsin-Madison

Training a DMM

Training a DMM requires passing it “known” cases Use an INSERT INTO in order to “insert” the data

to the DMM The DMM will usually not retain the inserted data Instead it will analyze the given cases and build the

DMM content (decision tree, segmentation model)

INSERT [INTO] <mining model name>[(columns list)]<source data query>

Page 35: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

35THE EDAM PROJECT University of Wisconsin-Madison

Making Predictions

SELECT [Customers].[ID], MyDMM.[Hair Color], PredictProbability(MyDMM.[Hair Color])

FROM MyDMM PREDICTION JOIN [Customers]ON MyDMM.[Gender] = [Customers].[Gender] ANDMyDMM.[Age] = [Customers].[Age]

Page 36: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

36THE EDAM PROJECT University of Wisconsin-Madison

Research DirectionsMRDM/ILP

Page 37: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

37THE EDAM PROJECT University of Wisconsin-Madison

MRDM Accomplishments

ILP origins, hypothesis discovery Classification Clustering Frequent itemsets Equational discovery Subgroup discovery Extensions of Bayesian nets to multiple

relations via key-foreign key traversals

Page 38: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

38THE EDAM PROJECT University of Wisconsin-Madison

Issues

Can we indeed capture the semantics exactly for each of these classes of patterns/models?Taking into account the details of the

underlying evaluation algorithm! Is the performance comparable to

specialized algorithms? Is it acceptable for a broad range of applications?

Page 39: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

39THE EDAM PROJECT University of Wisconsin-Madison

Positives

Impressive! Quite a range of patterns/models are shown to be expressible in this formalism Importantly, the added expressiveness allows new kinds

of patterns to be naturally formulated by a user There is a (more or less) common computational

structure consisting of Space of patterns to search Measure of support for a pattern Enumeration and pruning strategy over search space

What tangible benefits can we derive from this generality?

Page 40: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

40THE EDAM PROJECT University of Wisconsin-Madison

Challenges, Opportunities

If ILP notation is roughly analogous to relational calculus, what is the appropriate algebra? Equivalences, compositionality Cost-based optimization to find “optimal” evaluation

plans What kind of user input/domain knowledge can

be used to focus computation, or help with optimization?

Page 41: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

41THE EDAM PROJECT University of Wisconsin-Madison

Research DirectionsRelational Clustering

Page 42: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

42THE EDAM PROJECT University of Wisconsin-Madison

Problem Statement Goal: Discover clusters of attribute-values Data: A table T with attributes drawn from domains

D1,…,Dn

Thus, a tuple of T consists of a value from each domain, e.g., (a1,b2,c1)

T could be an arbitrary view over several tables!

a2

a1

a3

a4

b1

b2b3

c1

c2

c3

c4

A B C

Note: We expect sizes of D1,…,Dn to be small

Page 43: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

43THE EDAM PROJECT University of Wisconsin-Madison

STIRR (Gibson, Kleinberg, Raghavan, VLDB 98)

Intuition: Want to detect that “Honda and Toyota are related because unusually high numbers of both were sold in August.” If we also find that many Hondas and Nissans are

sold in Sept, and many dealers sell both Hondas and Acuras, this leads to a cluster best described as “late-summer sales of Japanese cars”

Approach: Techniques for spectral graph partitioning, generalized to hypergraphs. Attribute values as weighted vertices in a graph;

edges based on co-occurrence. Weights propagate along links, leading to a non-linear dynamical system.

Page 44: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

44THE EDAM PROJECT University of Wisconsin-Madison

CACTUS (Ganti, Gehrke, Ramakrishnan, KDD 99) Same motivation, different problem

formulation and approach Precise definition of cluster, deterministic

algorithm that computes all clusters Very efficient, scalable, SQL-based

algorithm

Page 45: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

45THE EDAM PROJECT University of Wisconsin-Madison

Similarity Between Attributes

“similarity’’ between a1 and b1 support(a1,b1) = number of tuples containing (a1,b1)

a1 and b1 are strongly connected if support(a1,b1) is higher than expected

{a1,a2,a3,a4} and {b1,b2} are strongly connected if all pairs are

a2

a1

a3

a4

b1

b2b3

b4

c1

c2

c3

c4

A B C

Not strongly connected

Page 46: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

46THE EDAM PROJECT University of Wisconsin-Madison

Similarity Within an Attribute

simA(b1,b2): Number of values of A which are strongly connected with both b1 and b2

a2

a1

a3

a4

b1

b2b3

b4

c1

c2

c3

c4

sim*(B) thru A thru C

(b1,b2) 4 2

(b1,b3) 0 2

(b1,b4) 0 0

(b2,b3) 0 2

(b2,b4) 0 0

A B C

Page 47: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

47THE EDAM PROJECT University of Wisconsin-Madison

Cluster Definition

Region: A cross-product of sets of attribute values: C1 x … x Cn

C=C1 x … x Cn is a cluster iff1. Ci and Cj are strongly connected, for all i,j

2. Ci is maximal, for all i

3. Support(C) > expected

Ci: cluster projection of C on Ai

Page 48: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

48THE EDAM PROJECT University of Wisconsin-Madison

The CACTUS Algorithm

Summarize Inter-attribute summaries: Scan dataset Intra-attribute summaries: Query IA

summaries Clustering phase

Compute cluster projectionsLevel-wise synthesis of cluster projections to

form candidate clusters Validation

Requires a scan of the dataset

Page 49: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

49THE EDAM PROJECT University of Wisconsin-Madison

Inter-Attribute Summaries Supports of all strongly connected attribute

value pairs from different attributes Similar in nature to “frequent’’ 2-itemsets So is the computation

a2

a1

a3

a4

b1

b2b3

b4

c1

c2

c3

c4

IJ(A,B) IJ(A,C) IJ(B,C)(a1,b1) (a1,c1) (b1,c1)

(a1,b2) (a1,c2) (b1,c2)

(a2,b1) (a2,c1) (b2,c1)

(a2,b2) (a2,c2) (b2,c2)

(a3,b1) (b3,c1)

… …

A B C

Page 50: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

50THE EDAM PROJECT University of Wisconsin-Madison

Intra-Attribute Summaries simA(B): Similarities through A of attribute

value pairs of B

a2

a1

a3

a4

b1

b2b3

b4

c1

c2

c3

c4

sim*(B) thru A thru C

(b1,b2) 4 2

(b1,b3) 0 2

(b1,b4) 0 0

(b2,b3) 0 2

(b2,b4) 0 0

A B C

Page 51: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

51THE EDAM PROJECT University of Wisconsin-Madison

Experimental Evaluation

Compare CACTUS with STIRR [GKR98] Synthetic datasets

Quasi-random data [GKR98:STIRR]Fix domain of each attributeRandomly generate tuples from these

domains Identify clusters and plant additional (5%)

data within the clusters

Page 52: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

52THE EDAM PROJECT University of Wisconsin-Madison

Synthetic Datasets

{0,…9} x {0,…9}{10,…,19} x {10,…,19}

0

9

19

10

20…

99

Both CACTUS and STIRR identified the two clusters exactly

Page 53: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

53THE EDAM PROJECT University of Wisconsin-Madison

Synthetic Dataset (contd.)

0

9

19

10

20…99

{0,…,9} x {0,…,9} x {0,…,9}{10,…,19} x {10,…,19} x {10,…,19}{0,…,9} x {10,…,19} x {10,…,19} Cactus identifies the 3 clusters

STIRR returns:{0,…,9} x {0,…,19} x {0,…,9}{10,…,19} x {0,…,19} x {10,…,19}

Page 54: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

54THE EDAM PROJECT University of Wisconsin-Madison

Scalability with #Tuples

Time vs. #Tuples

0

500

1000

1500

2000

2500

1 2 3 4 5

#Tuples (in millions)

Tim

e (in

sec

onds

)

CACTUS STIRR #Attributes: 10Domain Size: 100

CACTUS is 10 times faster

Page 55: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

55THE EDAM PROJECT University of Wisconsin-Madison

Scalability with #Attributes

Time vs. #Attributes

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

4 6 8 10 20 30 40 50#Attributes

Tim

e (in

sec

onds

)

CACTUS STIRR 1 million tuplesDomain size: 100

Page 56: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

56THE EDAM PROJECT University of Wisconsin-Madison

Scalability with Domain Size

Time vs. Domain Size

0

50

100

150

200

250

50 100 200 400 600 800 1000

#Attribute Values

Tim

e (in

sec

onds

)

CACTUS STIRR 1 million tuples#attributes: 4

Page 57: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

57THE EDAM PROJECT University of Wisconsin-Madison

Bibliographic Data

Database and theory bibliographic entries [Wie]—38500 entries

Attributes: first author, second author, conference/journal, and year

Example cluster projections on the conference attribute:

(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod Record(2). ACMTG, CompGeom, FOCS, Geometry, ICALP, IPL, JCSS, …(3). PODS, Algorithmica, FOCS, ICALP, INFCTRL, IPL, JCSS, …

Page 58: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

58THE EDAM PROJECT University of Wisconsin-Madison

ROCK (Guha, Rastogi, Shim, ICDE 99)

Each tuple is a node, and two nodes are linked if within a threshold distance.

Similarity between two nodes is the number of common neighbors.

ROCK does agglomerative hierarchical clustering based on similarity.

Page 59: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

59THE EDAM PROJECT University of Wisconsin-Madison

Research DirectionsThe EDAM Project

Page 60: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

60THE EDAM PROJECT University of Wisconsin-Madison

Example Tasks Label a spectrum to identify elements Find common elements across (subsets of) spectra

Collected at multiple locations, and multiple conditions, and … At different times, and over time periods

Find subsets of spectra (e.g., based on time periods and locations) with Unusually common elements Interesting characteristics Correlations to other spectral streams

Want to be able to reconstruct analysis done a year ago and run it on different data

Want to share ongoing analysis with colleagues and track changes and their impact

Page 61: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

61THE EDAM PROJECT University of Wisconsin-Madison

[Slides omitted from this version]

Page 62: 1 Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

62THE EDAM PROJECT University of Wisconsin-Madison

Conclusions Database systems hold a lot of the data people

care about and want to mine, making them an important part of the mining environment Especially for ongoing analysis and collaboration

Beyond this, there are a number of ideas and techniques in the DB literature that can be applied more broadly Formulations of mining tasks Algorithms

Scalability is an important idea from databases But there are many more—compositionality, query-

driven approach, set-oriented analyses