42
A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos

A General Framework for Mining Massive Data Streams

Embed Size (px)

DESCRIPTION

A General Framework for Mining Massive Data Streams. Geoff Hulten Advised by Pedro Domingos. Mining Massive Data Streams. High-speed data streams abundant Large retailers Long distance & cellular phone call records Scientific projects Large Web sites - PowerPoint PPT Presentation

Citation preview

Page 1: A General Framework for Mining Massive Data Streams

A General Framework for Mining Massive Data Streams

Geoff HultenAdvised by Pedro Domingos

Page 2: A General Framework for Mining Massive Data Streams

Mining Massive Data Streams

• High-speed data streams abundant– Large retailers– Long distance & cellular phone call records– Scientific projects– Large Web sites

• Build model of the process creating data

• Use model to interact more efficiently

Page 3: A General Framework for Mining Massive Data Streams

Growing Mismatch BetweenAlgorithms and Data

• State of the art data mining algorithms– One shot learning– Work with static databases– Maximum of 1 million – 10 million records

• Properties of Data Streams– Data stream exists over months or years– 10s – 100s of millions of new records per day– Process generating data changing over time

Page 4: A General Framework for Mining Massive Data Streams

The Cost of This Mismatch

• Fraction of data we can effectively mine shrinking towards zero

• Models learned from heuristically selected samples of data

• Models out of date before being deployed

Page 5: A General Framework for Mining Massive Data Streams

Need New Algorithms

• Monitor a data stream and have a model available at all times

• Improve the model as data arrives

• Adapt the model as process generating data changes

• Have quality guarantees

• Work within strict resource constraints

Page 6: A General Framework for Mining Massive Data Streams

Solution: General Framework

• Applicable to algorithms based on discrete search

• Semi-automatically converts algorithm to meet our design needs

• Uses sampling to select data size for each search step

• Extensions to continuous searches and relational data

Page 7: A General Framework for Mining Massive Data Streams

Outline

• Introduction

• Scaling up Decision Trees

• Our Framework for Scaling

• Other Applications and Results

• Conclusion

Page 8: A General Framework for Mining Massive Data Streams

Decision Trees

• Examples:

• Encode:

• Nodes contain tests

• Leaves contain predictions

Gender?

False Age?

Male Female

< 25 >= 25

yxx D ,,,1 DxxFy ,,1

False True

Page 9: A General Framework for Mining Massive Data Streams

Decision Tree Induction

DecisionTree(Data D, Tree T, Attributes A) If D is pure Let T be a leaf predicting class in D Return Let X be best of A according to D and G() Let T be a node that splits on X For each value V of X Let D^ be the portion of D with V for X Let T^ be the child of T for V DecisionTree(D^, T^, A – X)

Page 10: A General Framework for Mining Massive Data Streams

VFDT (Very Fast Decision Tree)

• In order to pick split attribute for a node looking at a few example may be sufficient

• Given a stream of examples:– Use the first to pick the split at the root– Sort succeeding ones to the leaves– Pick best attribute there– Continue…

• Leaves predict most common class• Very fast, incremental, any time decision tree

induction algorithm

Page 11: A General Framework for Mining Massive Data Streams

How Much Data?

• Make sure best attribute is better than second– That is:

• Using a sample so need Hoeffding bound– Collect data till: 21 XGXG

n

R

2

1ln2

021 XGXG

Page 12: A General Framework for Mining Massive Data Streams

Core VFDT Algorithm

Proceedure VFDT(Stream, δ)Let T = Tree with single leaf (root)Initialize sufficient statistics at rootFor each example (X, y) in Stream

Sort (X, y) to leaf using TUpdate sufficient statistics at leafCompute G for each attributeIf G(best) – G(2nd best) > ε, then

Split leaf on best attributeFor each branch

Start new leaf, init sufficient statisticsReturn T

x1?

y=0 x2?

y=0 y=1

male female

> 65 <= 65

Page 13: A General Framework for Mining Massive Data Streams

Quality of Trees from VFDT

• Model may contain incorrect splits, useful?

• Bound the difference with infinite data tree– Chance an arbitrary example takes different

path

• Intuition: example on level i of tree has i chances to go through a mistaken node

p

DTDT HT

,

Page 14: A General Framework for Mining Massive Data Streams

Complete VFDT System

• Memory management– Memory dominated by sufficient statistics– Deactivate less promising leaves when needed

• Ties:– Wasteful to decide between identical attributes

• Check for splits periodically• Pre-pruning

– Only make splits that improve the value of G(.)

• Early stop on bad attributes

G

Page 15: A General Framework for Mining Massive Data Streams

VFDT (Continued)

• Bootstrap with traditional learner

• Rescan dataset when time available

• Time changing data streams

• Post pruning

• Continuous attributes

• Batch mode

Page 16: A General Framework for Mining Massive Data Streams

Experiments

• Compared VFDT and C4.5 (Quinlan, 1993)

• Same memory limit for both (40 MB)– 100k examples for C4.5

• VFDT settings: δ = 10^-7, τ = 5%

• Domains: 2 classes, 100 binary attributes

• Fifteen synthetic trees 2.2k – 500k leaves

• Noise from 0% to 30%

Page 17: A General Framework for Mining Massive Data Streams
Page 18: A General Framework for Mining Massive Data Streams
Page 19: A General Framework for Mining Massive Data Streams

Running Times

• Pentium III at 500 MHz running Linux

• C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds

• VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process

• VFDT processes 32k examples per second (excluding I/O)

Page 20: A General Framework for Mining Massive Data Streams
Page 21: A General Framework for Mining Massive Data Streams

Real World Data Sets:Trace of UW Web requests

• Stream of Web page request from UW• One week 23k clients, 170 orgs. 244k hosts,

82.8M requests (peak: 17k/min), 20GB• Goal: improve cache by predicting requests• 1.6M examples, 61% default class• C4.5 on 75k exs, 2975 secs.

– 73.3% accuracy

• VFDT ~3000 secs., 74.3% accurate

Page 22: A General Framework for Mining Massive Data Streams

Outline

• Introduction

• Scaling up Decision Trees

• Our Framework for Scaling

• Overview of Applications and Results

• Conclusion

Page 23: A General Framework for Mining Massive Data Streams

Data Mining as Discrete Search

...

• Initial state– Empty – prior – random

• Search operators– Refine structure

• Evaluation function– Likelihood – many other

• Goal state– Local optimum, etc.

Page 24: A General Framework for Mining Massive Data Streams

Data Mining As Search

...

...

Training Data Training Data Training Data

1.7

1.5

1.8

1.9

1.9

2.0

Page 25: A General Framework for Mining Massive Data Streams

Example: Decision Tree

...

Training Data

1.7

1.5

X1?

Xd?

??

...

X1?

...

X1?

Training Data • Initial state– Root node

• Search operators– Turn any leaf into

a test on attribute• Evaluation

– Entropy Reduction

• Goal state– No further gain– Post prune

)(

lgyval

ii pp

Page 26: A General Framework for Mining Massive Data Streams

Overview of Framework

• Cast the learning algorithm as a search

• Begin monitoring data stream– Use each example to update sufficient

statistics where appropriate (then discard it)– Periodically pause and use statistical tests

• Take steps that can be made with high confidence

– Monitor old search decisions• Change them when data stream changes

Page 27: A General Framework for Mining Massive Data Streams

How Much Data is Enough?

...

Training Data

1.65

1.38 Xd?

X1?

Page 28: A General Framework for Mining Massive Data Streams

How Much Data is Enough?

...

Sample of Data

1.6 +/- ε• Use statistical bounds

– Normal distribution– Hoeffding bound

• Applies to scores that are average over examples

• Can select a winner if– Score1 > Score2 + ε

1.4 +/- ε

nR 21ln2

Xd?

X1?

Page 29: A General Framework for Mining Massive Data Streams

Global Quality Guarantee

• δ – probability of error in single decision

• b – branching factor of search

• d – depth of search

• c – number of checks for winner

δ* = δbdc

Page 30: A General Framework for Mining Massive Data Streams

Identical States And Ties

• Fails if states are identical (or nearly so)

• τ – user supplied tie parameter

• Select winner early if alternatives differ by less than τ– Score1 > Score2 + ε or – ε <= τ

Page 31: A General Framework for Mining Massive Data Streams

Dealing with Time Changing Concepts

• Maintain a window of the most recent examples• Keep model up to date with this window• Effective when window size similar to concept

drift rate• Traditional approach

– Periodically reapply learner– Very inefficient!

• Our approach– Monitor quality of old decisions as window shifts– Correct decisions in fine-grained manner

Page 32: A General Framework for Mining Massive Data Streams

Alternate Searches• When new test looks better grow alternate sub-tree• Replace the old when new is more accurate• This smoothly adjusts to changing concepts

Gender?

Pets? College?

Hair?

false

false true

falsetrue true

Page 33: A General Framework for Mining Massive Data Streams

RAM Limitations• Each search

requires sufficient statistics structure

• Decision Tree– O(avc) RAM

• Bayesian Network– O(c^p) RAM

Page 34: A General Framework for Mining Massive Data Streams

RAM Limitations

Active

Temporarily inactive

Page 35: A General Framework for Mining Massive Data Streams

Outline

• Introduction

• Data Mining as Discrete Search

• Our Framework for Scaling

• Application to Decision Trees

• Other Applications and Results

• Conclusion

Page 36: A General Framework for Mining Massive Data Streams

Applications

• VFDT (KDD ’00) – Decision Trees• CVFDT (KDD ’01) – VFDT + concept drift• VFBN & VFBN2 (KDD ’02) – Bayesian Networks• Continuous Searches

– VFKM (ICML ’01) – K-Means clustering– VFEM (NIPS ’01) – EM for mixtures of Gaussians

• Relational Data Sets– VFREL (Submitted) – Feature selection in relational

data

Page 37: A General Framework for Mining Massive Data Streams

CFVDT Experiments

Page 38: A General Framework for Mining Massive Data Streams

Activity Profile for VFBN

Page 39: A General Framework for Mining Massive Data Streams

Other Real World Data Sets

• Trace of all web requests from UW campus– Use clustering to find good locations for proxy caches

• KDD Cup 2000 Data set– 700k page requests from an e-commerce site– Categorize pages into 65 categories, predict which a session will

visit

• UW CSE Data set– 8 Million sessions over two years– Predict which of 80 level 2 directories each visits

• Web Crawl of .edu sites– Two data sets each with two million web pages– Use relational structure to predict which will increase in

popularity over time

Page 40: A General Framework for Mining Massive Data Streams

Related Work

• DB Mine: A Performance Perspective (Agrawal, Imielinski, Swami ‘93)– Framework for scaling rule learning

• RainForest (Gehrke, Ramakrishnan, Ganti ‘98)

– Framework for scaling decision trees• ADtrees (Moore, Lee ‘97)

– Accelerate computing sufficient stats• PALO (Greiner ‘92)

– Accelerate hill climbing search via sampling• DEMON (Ganti, Gehrke, Ramakrishnan ‘00)

– Framework for converting incremental algs. for time changing data streams

Page 41: A General Framework for Mining Massive Data Streams

Future Work

• Combine framework for discrete search with frameworks for continuous search and relational learning

• Further study time changing processes• Develop a language for specifying data stream

learning algorithms• Use framework to develop novel algorithms for

massive data streams• Apply algorithms to more real-world problems

Page 42: A General Framework for Mining Massive Data Streams

Conclusion

• Framework helps scale up learning algorithms based on discrete search

• Resulting algorithms:– Work on databases and data streams– Work with limited resources – Adapt to time changing concepts– Learn in time proportional to concept complexity

• Independent of amount of training data!

• Benefits have been demonstrated in a series of applications