Upload
norman-reeves
View
219
Download
0
Embed Size (px)
DESCRIPTION
THE EDAM PROJECT University of Wisconsin-Madison 3 Mining at a Crossroads Data Mining has drawn upon ideas and people from many disciplines, and has grown rapidly. As yet, no unifying vision of how these disciplines leverage each other. Stats folks still do stats, ML folks still do ML, DB folks still think about large datasets—and they rarely talk amongst each other. What are the applications that will pay the piper?
Citation preview
1
Mining: A Database Perspective
Raghu RamakrishnanUniv. of Wisconsin-Madison
2THE EDAM PROJECT University of Wisconsin-Madison
Data Mining Classification
Decision trees Regression SVMs Naïve Bayes Meta-learners, ensembles
Clustering K-means Hierarchical methods EM
MRDM/ILP pattern discovery Horn rules; PRMs
Frequent item analysis Associations, sequential patterns
Time-series analysis Linear and nonlinear dynamics
Collaborative filtering Text, multimedia mining
ML/AI
Optimization
DB Stats
3THE EDAM PROJECT University of Wisconsin-Madison
Mining at a Crossroads
Data Mining has drawn upon ideas and people from many disciplines, and has grown rapidly.
As yet, no unifying vision of how these disciplines leverage each other. Stats folks still do stats, ML folks still do ML, DB folks
still think about large datasets—and they rarely talk amongst each other.
What are the applications that will pay the piper?
4THE EDAM PROJECT University of Wisconsin-Madison
About this Talk
A database perspective on data mining and its relationship to data management How can database-oriented thinking influence
research and practice in data mining? What are the difficult problems with big payoffs?
The EDAM project at Wisconsin Analyzing streams of mass spectra and other
spatio-temporal data Joint work with researchers in atmospheric
aerosols and climatology at UW-Madison and Carleton College, funded by an NSF ITR
5THE EDAM PROJECT University of Wisconsin-Madison
Outline
A Database perspective Recent extensions to relational systems
OLAP: Cube, sequence queriesData mining support
Relational approaches to miningRelational clusteringMRDM/ILP
The EDAM project
6THE EDAM PROJECT University of Wisconsin-Madison
A Database Perspective
7THE EDAM PROJECT University of Wisconsin-Madison
All the World’s a Table
All data is in a database. If not, it’s not important
Data mining is a class of analysis techniques that complements current SQL data analysis capabilities. Data is in a DBMS for reasons that go well beyond
the analysis capabilities of the DBMS, even if these are often inadequate.
And if the past is any indication, the DB vendors will try to expand SQL to support whatever DM capabilities the market will pay for—and it’s not clear that this is the right architecture.
8THE EDAM PROJECT University of Wisconsin-Madison
Scalability Widely recognized as a characteristic DB concern, and
that it provides useful techniques to deal with scale. BIRCH—Scalable pre-clustering that borrows ideas from B+
trees Rainforest—Framework for scaling decision tree construction
that borrows from hash joins (There are also scalable algorithms based on EM and
Bootstrapping) However, the focus has been on one aspect of scale:
Size of training data We also need scalability with respect to other problem
dimensions: Size of hypothesis space Rate of data capture and analysis
9THE EDAM PROJECT University of Wisconsin-Madison
Queries vs. Mining From the point of view of the user, SQL queries
are one way to explore and understand the data. But is it “data mining”? The various data mining techniques are no more (or
less) than alternatives with different capabilities. The query framework has some ideas worth
borrowing and generalizing: Compositionality—more flexibility, more automation Usability—domain analysts, not tool experts Query Optimization
10THE EDAM PROJECT University of Wisconsin-Madison
A Different Mindset …
Sometimes, just looking at the problem from a different perspective may lead to useful reformulations: Frequent itemsets Relational clustering Stream analysis Labeling spectra Subset mining
“What does a query mean?” vs. “How do I characterize my data?” Hopefully, not mutually exclusive!
Can raise very different concerns E.g., Coverage, accuracy (ML), confidence bounds (Stats) vs.
query equivalence, compositionality (DB) Combining multiple sources of information (e.g., multiple tables)
11THE EDAM PROJECT University of Wisconsin-Madison
Query Optimization Driven by user’s query
Goal is to find answers to this query efficiently Search space for optimization
Defined through equivalences to given query Exploits compositionality!
“Goodness” metric is estimated plan cost Contrast this with the search spaces typical in, e.g.,
rule discovery or attribute selection These are data-driven, not query-driven Search space based on hypothesis refinement “Goodness” metric based on coverage of training set
12THE EDAM PROJECT University of Wisconsin-Madison
Data Management Management
Data storage and archival Privacy, sharing, collaboration
Focus has been on managing data; however: Queries can be stored in the DBMS Views, or tables defined by queries (Ownership, access control, re-optimization, caching)
We need more support for managing analyses: Managing analyses external to the DBMS Provenance of data and analysis Versioning and collaboration support Support for ongoing analyses: Impact of data changes
on analyses; monitoring; trend analysis over warehouses; deploying results into operational system
13THE EDAM PROJECT University of Wisconsin-Madison
IndexerMiner
Files, Logs DBMS
RAID STORAGE
Warehouse
Data Co-Processor Architecture
Small readsLarge R/W
Periodicoffline activity
Queries/Searches
14THE EDAM PROJECT University of Wisconsin-Madison
Updates
SQL Queries
OLAP Queries
Text Queries
SYNC
CUSTOMIZED ASYNCHRONOUS REPLICAS
15THE EDAM PROJECT University of Wisconsin-Madison
Recent Extensions of Relational Queries
16THE EDAM PROJECT University of Wisconsin-Madison
Star Schema
Transactions(timekey,storekey,pkey,promkey,ckey,units,price)
Time Store
Customers ProductsPromotions
17THE EDAM PROJECT University of Wisconsin-Madison
Multidimensional Analysis
NY CA WI
Industry1 $1000 $2000 $1000
Industry2 $500 $1000 $500
Industry3 $3000 $3000 $3000
Industry
CategoryCountry=“USA”
State
City
YearQuarter
Month WeekDayProduct
18THE EDAM PROJECT University of Wisconsin-Madison
Slice and Drill-Down
SanFrancisco
San Jose Los Angeles
Category1 $300 $300 $400
Category2 $300 $300 $400
Category3 $100 $800 $100
Industry=“Industry3”
Category
Country
State=“CA”
City
YearQuarter
Month WeekDayProduct
THE EDAM PROJECT University of Wisconsin-Madison
Comparison with SQL
SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.city
SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.city
20THE EDAM PROJECT University of Wisconsin-Madison
Visual Intuition: Cube
Location
Product
TimeM T W Th F S S
Product1Product2Product3Product4Product5Product6
SHSF
LA
203020151050
50 Units of Product6 sold on Monday in LA
roll-up to week
roll-up to category
roll-up to state
THE EDAM PROJECT University of Wisconsin-Madison
CUBE Operator
For k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.
CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets
of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:
SELECT SUM(S.sales)FROM Sales SGROUP BY grouping-list
22THE EDAM PROJECT University of Wisconsin-Madison
Observation
When you need to consider several related or overlapping computationsThink of how to expose this space to the
user, and to get user input on what part of the space might be interestingMarketing specialists can use OLAP interfaces to
do very complex queries easilyThink of how to optimize by exploiting
commonality across computations
THE EDAM PROJECT University of Wisconsin-Madison
Querying Sequences SQL-92 supports queries over relations.
A relation is a (multi) set of records. No ordering of records in a relation!
Queries involving order are hard or impossible to express, and typically, inefficiently evaluated. Find weekly moving average of the DJIA. Compute % change of each stock during ‘97, and then find
stocks in the top 5% (those that changed most). SQL:1999 supports the concept of windowing, which
effectively orders tuples for query purposes.
THE EDAM PROJECT University of Wisconsin-Madison
SRQL(Ramakrishnan et al., SSDBM 98)
Proposed a sequencing operator as an extension to relational algebra.
g s v3 4 a3 6 b3 6 c3 9 b2 1 a4 3 d
Applied to a table R, with grouping attrs g and sequencing attrs s, it returns the corresponding composite sequence.
ord g s v1 3 4 a2 3 6 b2 3 6 c3 3 9 b1 2 1 a1 4 3 d
THE EDAM PROJECT University of Wisconsin-Madison
Find the 2-day moving average of volume sold for each product: In effect, creates a sequence by day for each product,
and computes the moving average over each of these sequences.
Observe how this generalizes SQL’s GROUP BY: illustrates power of composite sequences and aggregation.
SELECT product, day, AVG(vol) OVER 0 TO 1FROM SalesGROUP BY productSEQUENCE BY day
Example
THE EDAM PROJECT University of Wisconsin-Madison
Variants of Aggregation
We can now introduce “running sum” and other cumulative aggregate functions!OVER FIRST TO 0: This gives us “running”
or “cumulative” aggregates.RANK() is CUMULATIVE COUNT(*)PERCENTILE() is (RANK()/COUNT(*))*100
Elegant way to express concepts like “give me the first few answers”.
SQL:1999 does all this and more (different syntax)
27THE EDAM PROJECT University of Wisconsin-Madison
Observation
Still much more limited than time-series analysis and mining techniques available elsewhere
No support for streams
28THE EDAM PROJECT University of Wisconsin-Madison
DBMS Support for Managing Mining Models
29THE EDAM PROJECT University of Wisconsin-Madison
Why Integrate?
Data
Copy
Extract
Models
Consistency?
Mine
30THE EDAM PROJECT University of Wisconsin-Madison
Integration Objectives
Avoid isolation of querying from mining Difficult to do “ad-hoc”
mining Provide simple
programming approach to creating and using DM models
Make it possible to add new models
Make it possible to add new, scalable algorithms
Analysts (users) DM Vendors
31THE EDAM PROJECT University of Wisconsin-Madison
DM Concepts to Support
Representation of input (cases) Representation of models Specification of training step Specification of prediction step
Should be independent of specific algorithms
32THE EDAM PROJECT University of Wisconsin-Madison
Types of Columns
Cust ID Age
MaritalStatus
WealthProduct PurchasesProduct Quantity Type
1 35 M 380,000
TV 1 Appliance
Coke 6 DrinkHam 3 Food Keys: Columns that uniquely identify a case
Attributes: Columns that describe a case Value: A state associated with the attribute in a specific case Attribute Property: Columns that describe an attribute
Unique for a specific attribute value (TV is always an appliance) Attribute Modifier: Columns that represent additional “meta” information for
an attribute Weight of a case, Certainty of prediction
Single case!
33THE EDAM PROJECT University of Wisconsin-Madison
Representing a DMM
Specifying a Model Columns it should predict Algorithm to use Special parameters
Model is represented as a nested table Specification = Create table Training = Inserting data into the table Predicting = Querying the table
34THE EDAM PROJECT University of Wisconsin-Madison
Training a DMM
Training a DMM requires passing it “known” cases Use an INSERT INTO in order to “insert” the data
to the DMM The DMM will usually not retain the inserted data Instead it will analyze the given cases and build the
DMM content (decision tree, segmentation model)
INSERT [INTO] <mining model name>[(columns list)]<source data query>
35THE EDAM PROJECT University of Wisconsin-Madison
Making Predictions
SELECT [Customers].[ID], MyDMM.[Hair Color], PredictProbability(MyDMM.[Hair Color])
FROM MyDMM PREDICTION JOIN [Customers]ON MyDMM.[Gender] = [Customers].[Gender] ANDMyDMM.[Age] = [Customers].[Age]
36THE EDAM PROJECT University of Wisconsin-Madison
Research DirectionsMRDM/ILP
37THE EDAM PROJECT University of Wisconsin-Madison
MRDM Accomplishments
ILP origins, hypothesis discovery Classification Clustering Frequent itemsets Equational discovery Subgroup discovery Extensions of Bayesian nets to multiple
relations via key-foreign key traversals
38THE EDAM PROJECT University of Wisconsin-Madison
Issues
Can we indeed capture the semantics exactly for each of these classes of patterns/models?Taking into account the details of the
underlying evaluation algorithm! Is the performance comparable to
specialized algorithms? Is it acceptable for a broad range of applications?
39THE EDAM PROJECT University of Wisconsin-Madison
Positives
Impressive! Quite a range of patterns/models are shown to be expressible in this formalism Importantly, the added expressiveness allows new kinds
of patterns to be naturally formulated by a user There is a (more or less) common computational
structure consisting of Space of patterns to search Measure of support for a pattern Enumeration and pruning strategy over search space
What tangible benefits can we derive from this generality?
40THE EDAM PROJECT University of Wisconsin-Madison
Challenges, Opportunities
If ILP notation is roughly analogous to relational calculus, what is the appropriate algebra? Equivalences, compositionality Cost-based optimization to find “optimal” evaluation
plans What kind of user input/domain knowledge can
be used to focus computation, or help with optimization?
41THE EDAM PROJECT University of Wisconsin-Madison
Research DirectionsRelational Clustering
42THE EDAM PROJECT University of Wisconsin-Madison
Problem Statement Goal: Discover clusters of attribute-values Data: A table T with attributes drawn from domains
D1,…,Dn
Thus, a tuple of T consists of a value from each domain, e.g., (a1,b2,c1)
T could be an arbitrary view over several tables!
a2
a1
a3
a4
b1
b2b3
c1
c2
c3
c4
A B C
Note: We expect sizes of D1,…,Dn to be small
43THE EDAM PROJECT University of Wisconsin-Madison
STIRR (Gibson, Kleinberg, Raghavan, VLDB 98)
Intuition: Want to detect that “Honda and Toyota are related because unusually high numbers of both were sold in August.” If we also find that many Hondas and Nissans are
sold in Sept, and many dealers sell both Hondas and Acuras, this leads to a cluster best described as “late-summer sales of Japanese cars”
Approach: Techniques for spectral graph partitioning, generalized to hypergraphs. Attribute values as weighted vertices in a graph;
edges based on co-occurrence. Weights propagate along links, leading to a non-linear dynamical system.
44THE EDAM PROJECT University of Wisconsin-Madison
CACTUS (Ganti, Gehrke, Ramakrishnan, KDD 99) Same motivation, different problem
formulation and approach Precise definition of cluster, deterministic
algorithm that computes all clusters Very efficient, scalable, SQL-based
algorithm
45THE EDAM PROJECT University of Wisconsin-Madison
Similarity Between Attributes
“similarity’’ between a1 and b1 support(a1,b1) = number of tuples containing (a1,b1)
a1 and b1 are strongly connected if support(a1,b1) is higher than expected
{a1,a2,a3,a4} and {b1,b2} are strongly connected if all pairs are
a2
a1
a3
a4
b1
b2b3
b4
c1
c2
c3
c4
A B C
Not strongly connected
46THE EDAM PROJECT University of Wisconsin-Madison
Similarity Within an Attribute
simA(b1,b2): Number of values of A which are strongly connected with both b1 and b2
a2
a1
a3
a4
b1
b2b3
b4
c1
c2
c3
c4
sim*(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A B C
47THE EDAM PROJECT University of Wisconsin-Madison
Cluster Definition
Region: A cross-product of sets of attribute values: C1 x … x Cn
C=C1 x … x Cn is a cluster iff1. Ci and Cj are strongly connected, for all i,j
2. Ci is maximal, for all i
3. Support(C) > expected
Ci: cluster projection of C on Ai
48THE EDAM PROJECT University of Wisconsin-Madison
The CACTUS Algorithm
Summarize Inter-attribute summaries: Scan dataset Intra-attribute summaries: Query IA
summaries Clustering phase
Compute cluster projectionsLevel-wise synthesis of cluster projections to
form candidate clusters Validation
Requires a scan of the dataset
49THE EDAM PROJECT University of Wisconsin-Madison
Inter-Attribute Summaries Supports of all strongly connected attribute
value pairs from different attributes Similar in nature to “frequent’’ 2-itemsets So is the computation
a2
a1
a3
a4
b1
b2b3
b4
c1
c2
c3
c4
IJ(A,B) IJ(A,C) IJ(B,C)(a1,b1) (a1,c1) (b1,c1)
(a1,b2) (a1,c2) (b1,c2)
(a2,b1) (a2,c1) (b2,c1)
(a2,b2) (a2,c2) (b2,c2)
(a3,b1) (b3,c1)
… …
A B C
50THE EDAM PROJECT University of Wisconsin-Madison
Intra-Attribute Summaries simA(B): Similarities through A of attribute
value pairs of B
a2
a1
a3
a4
b1
b2b3
b4
c1
c2
c3
c4
sim*(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A B C
51THE EDAM PROJECT University of Wisconsin-Madison
Experimental Evaluation
Compare CACTUS with STIRR [GKR98] Synthetic datasets
Quasi-random data [GKR98:STIRR]Fix domain of each attributeRandomly generate tuples from these
domains Identify clusters and plant additional (5%)
data within the clusters
52THE EDAM PROJECT University of Wisconsin-Madison
Synthetic Datasets
{0,…9} x {0,…9}{10,…,19} x {10,…,19}
0
9
19
10
20…
99
Both CACTUS and STIRR identified the two clusters exactly
53THE EDAM PROJECT University of Wisconsin-Madison
Synthetic Dataset (contd.)
0
9
19
10
20…99
{0,…,9} x {0,…,9} x {0,…,9}{10,…,19} x {10,…,19} x {10,…,19}{0,…,9} x {10,…,19} x {10,…,19} Cactus identifies the 3 clusters
STIRR returns:{0,…,9} x {0,…,19} x {0,…,9}{10,…,19} x {0,…,19} x {10,…,19}
54THE EDAM PROJECT University of Wisconsin-Madison
Scalability with #Tuples
Time vs. #Tuples
0
500
1000
1500
2000
2500
1 2 3 4 5
#Tuples (in millions)
Tim
e (in
sec
onds
)
CACTUS STIRR #Attributes: 10Domain Size: 100
CACTUS is 10 times faster
55THE EDAM PROJECT University of Wisconsin-Madison
Scalability with #Attributes
Time vs. #Attributes
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
4 6 8 10 20 30 40 50#Attributes
Tim
e (in
sec
onds
)
CACTUS STIRR 1 million tuplesDomain size: 100
56THE EDAM PROJECT University of Wisconsin-Madison
Scalability with Domain Size
Time vs. Domain Size
0
50
100
150
200
250
50 100 200 400 600 800 1000
#Attribute Values
Tim
e (in
sec
onds
)
CACTUS STIRR 1 million tuples#attributes: 4
57THE EDAM PROJECT University of Wisconsin-Madison
Bibliographic Data
Database and theory bibliographic entries [Wie]—38500 entries
Attributes: first author, second author, conference/journal, and year
Example cluster projections on the conference attribute:
(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod Record(2). ACMTG, CompGeom, FOCS, Geometry, ICALP, IPL, JCSS, …(3). PODS, Algorithmica, FOCS, ICALP, INFCTRL, IPL, JCSS, …
58THE EDAM PROJECT University of Wisconsin-Madison
ROCK (Guha, Rastogi, Shim, ICDE 99)
Each tuple is a node, and two nodes are linked if within a threshold distance.
Similarity between two nodes is the number of common neighbors.
ROCK does agglomerative hierarchical clustering based on similarity.
59THE EDAM PROJECT University of Wisconsin-Madison
Research DirectionsThe EDAM Project
60THE EDAM PROJECT University of Wisconsin-Madison
Example Tasks Label a spectrum to identify elements Find common elements across (subsets of) spectra
Collected at multiple locations, and multiple conditions, and … At different times, and over time periods
Find subsets of spectra (e.g., based on time periods and locations) with Unusually common elements Interesting characteristics Correlations to other spectral streams
Want to be able to reconstruct analysis done a year ago and run it on different data
Want to share ongoing analysis with colleagues and track changes and their impact
61THE EDAM PROJECT University of Wisconsin-Madison
[Slides omitted from this version]
62THE EDAM PROJECT University of Wisconsin-Madison
Conclusions Database systems hold a lot of the data people
care about and want to mine, making them an important part of the mining environment Especially for ongoing analysis and collaboration
Beyond this, there are a number of ideas and techniques in the DB literature that can be applied more broadly Formulations of mining tasks Algorithms
Scalability is an important idea from databases But there are many more—compositionality, query-
driven approach, set-oriented analyses