Upload
mikio-braun
View
215
Download
0
Embed Size (px)
Citation preview
7/30/2019 Online Learning with Stream Mining
1/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Online Learning withStream Mining
Mikio L. Braun, @mikiobraunhttp://blog.mikiobraun.de
TWIMPACT
http://twimpact.com
http://blog.mikiobraun.de/http://twimpact.com/http://twimpact.com/http://blog.mikiobraun.de/7/30/2019 Online Learning with Stream Mining
2/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Event Data
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie
FinanceFinance GamingGaming MonitoringMonitoring
Social MediaSocial Media
Sensor NetworksSensor Networks Advertisment Advertisment
7/30/2019 Online Learning with Stream Mining
3/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Online Learning
Isn't all learning online?
7/30/2019 Online Learning with Stream Mining
4/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Isn't Machine Learning easilyOnline?
Stochastic gradient descent
converges, e.g. if
http://leon.bottou.org/research/stochastic
http://leon.bottou.org/research/stochastichttp://leon.bottou.org/research/stochastic7/30/2019 Online Learning with Stream Mining
5/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Online vs Batch: Non-stationarities
Very good tutorial by Albert Bifet et al. on these issues at http://sites.google.com/site/advancedstreamingtutorial
http://sites.google.com/site/advancedstreamingtutorialhttp://sites.google.com/site/advancedstreamingtutorial7/30/2019 Online Learning with Stream Mining
6/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Time horizons vs. Learning rate
You can't just do online learning on event data!
7/30/2019 Online Learning with Stream Mining
7/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Also, Event Data is huge The problem: You easily get A LOT OF DATA!
100 events per second 360k events per hour 8.6M events per day 260M events per month 3.2B events per year
7/30/2019 Online Learning with Stream Mining
8/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
So, online learning challenges:
So much data!
Concept Drift Online (as in not batch)
is not the whole story.
7/30/2019 Online Learning with Stream Mining
9/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Digging into Least Squares
d
d
with entries
this could be huge!
d x d is probably ok
Idea: Batch method like least squares on recent portionof the data.
But: It's just a sum!
7/30/2019 Online Learning with Stream Mining
10/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Another Problem: High-dimensionalSpaces
Potentially large spaces: distinct words: >100k IP addresses: >100M users in a social network: >10M
http://www.flickr.com/photos/arenamontanus/269158554/http://wordle.net
7/30/2019 Online Learning with Stream Mining
11/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Stream Mining to the rescue Stream mining algorithms:
answer stream queries with finite resources Typical examples:
how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items?
Continuous Stream of Data
Bounded Resource Analyzer Stream Queries
7/30/2019 Online Learning with Stream Mining
12/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
The Trade-Off
ExactFast
Big Data
Stream Mining Map Reduceand friends
First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra
http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandrahttp://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra7/30/2019 Online Learning with Stream Mining
13/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Heavy Hitters (a.k.a. Top-k) Count activities over large item sets (millions,
even more, e.g. IP addresses, Twitter users) Interested in most active elements only.
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conferenceon Database Theory, 2005
132
142
432
553
712023
15
12
85
32
Fixed tables of counts
Case 1: element already in data base
Case 2: new element
142 142 12 13
713 023 2
713 3
7/30/2019 Online Learning with Stream Mining
14/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Count-Min Sketches Summarize histograms over large feature sets Like bloom filters, but better
Query: Take minimum over all hash functions
0 0 3 0
1 1 0 2
0 2 0 0
0 3 5 20 5 3 2
2 4 5 01 3 7 3
0 2 0 8
m bins
n differenthash functions
Updates for new entryQuery result: 1
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications.LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
7/30/2019 Online Learning with Stream Mining
15/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Clustering with count-min Sketches
Online clustering For each data point:
Map to closest centroid ( compute distances) Update centroid
count-min sketches to represent sum over allvectors in a class
Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009
0 0 3 01 1 0 2 0 2 0 00 3 5 20 5 3 22 4 5 0
1 3 7 30 2 0 8
7/30/2019 Online Learning with Stream Mining
16/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Heavy Hitters over Time-Window
Time
DB
Keep quite a big log (amonth?)
Constant write/erase indatabase
Alternative: Exponentialdecay
7/30/2019 Online Learning with Stream Mining
17/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Exponential Decay
Instead of a fixed window, use exponentialdecay
The beauty: updates are recursive
halftime
score
timestamp
time shift term
7/30/2019 Online Learning with Stream Mining
18/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Exponential Decay Collect stats by a table of expdecay counters
counters[item] # countersts[item] # last timestamp
update(C, item, timestamp, count) update countsC.counters[item] = count +
weight(timestamp, C.ts[item]) * C.counters[item]C.ts[item] = timestampC.lastupdate = timestamp
score(C, item) return scorereturn weight(C.lastupdate, C.ts[item])
* C.counters[item]
7/30/2019 Online Learning with Stream Mining
19/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Least Squares Revisited Need to compute For each do
Then, reconstruct
As a reminder:
7/30/2019 Online Learning with Stream Mining
20/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
More: Maximum-Likelihood Estimate probabilistic models
But wait, how do I 1/n with randomly spacedevents?
based on
which is slightlybiased, butsimpler
7/30/2019 Online Learning with Stream Mining
21/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Outlier detection Once you have a model, you can compute
p-values (based on recent time frames!)
7/30/2019 Online Learning with Stream Mining
22/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
TF-IDF estimate word document frequencies
for each word: update(word, t, 1.0) for each document: update(#docs, t, 1.0)
query: score(word) / score(#docs)
7/30/2019 Online Learning with Stream Mining
23/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Extracting a relevant subset
7/30/2019 Online Learning with Stream Mining
24/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Classification with Nave Bayes Naive Bayes is also just counting, right?
class priors
Multinomnial nave Bayes
Priors
Total number of words in class
Number of times word appears in classfrequency of word in document
7/30/2019 Online Learning with Stream Mining
25/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Classification with Naive Bayes
ICML 2003
7/30/2019 Online Learning with Stream Mining
26/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Classification with Naive Bayes 7 Steps to improve NB:
transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log
normalize those weights again Predict linearly using those weights
7/30/2019 Online Learning with Stream Mining
27/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
What about non-parametricmethods and Kernel Methods?
Problem here, no real accumulation of information instatistics, e.g. SVMs
Could still use streamdrill to extract a representativesubset.
sum over all elements!
7/30/2019 Online Learning with Stream Mining
28/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Streamdrill Heavy Hitters counting + exponential decay Instant counts & top-k results over time
windows. Indices! Snapshots for historical analysis Beta demo available at http://streamdrill.com ,
launch imminent
http://streamdrill.com/http://streamdrill.com/7/30/2019 Online Learning with Stream Mining
29/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Architecture Overview
7/30/2019 Online Learning with Stream Mining
30/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
REST API Create a trend
/1/create/plays/user:song:location?size=1000?timescales=day,hour
Update a word/1/update/plays/frank:123123:San+Francisco
Another word (with timestamp)/1/update/plays/paul:145323:Berlin?ts=131341354135
Get most played songs for SF or Paul/1/query/plays?city=San+Francisco/1/query/plays?user=paul
Get score for a word/1/query/score/hello
7/30/2019 Online Learning with Stream Mining
31/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Example: Twitter Stock Analysis
http://play.streamdrill.com/vis/
http://play.streamdrill.com/vis/http://play.streamdrill.com/vis/7/30/2019 Online Learning with Stream Mining
32/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Example: Twitter Stock Analysis Trends:
symbol:combinations $AAPL:$GOOG symbol:hashtag $AAPL:#trading symbol:keywords $GOOG:disruption symbol:mentions $GOOG:WallStreetCom symbol trend $AAPL
symbol:url $FB:http://on.wsj.com/15fHaZW
7/30/2019 Online Learning with Stream Mining
33/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Example: Twitter Stock Analysis
7/30/2019 Online Learning with Stream Mining
34/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Example: Twitter Stock Analysis
7/30/2019 Online Learning with Stream Mining
35/36
Machine Learning Meetup San Francisco, April 24, 2013 (c) 2013 by TWIMPACT
Example: Twitter Stock Analysis
stream drill
JavaScriptvia REST
Tweet Analyzer
tweets
updates
7/30/2019 Online Learning with Stream Mining
36/36
M hi L i M t S F i A il 24 2013 ( ) 2013 b TWIMPACT
Summary Doesn't always have to be scaling! Stream mining : Approximate results with
finite resources. stream drill: stream analysis engine