Upload
mikio-braun
View
215
Download
0
Embed Size (px)
Citation preview
7/28/2019 Online Learning with Streamdrill
1/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Online Learning WithStreamdrill
Mikio L. Braun, @mikiobraunhttp://blog.mikiobraun.de
TWIMPACT
http://twimpact.com
http://blog.mikiobraun.de/http://twimpact.com/http://twimpact.com/http://blog.mikiobraun.de/7/28/2019 Online Learning with Streamdrill
2/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Event Data
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie
FinanceFinance GamingGaming MonitoringMonitoring
Social MediaSocial Media
Sensor NetworksSensor NetworksAdvertismentAdvertisment
7/28/2019 Online Learning with Streamdrill
3/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Event Data is Huge: Volume
The problem: You easily get A LOT OF DATA!
100 events per second
360k events per hour
8.6M events per day
260M events per month
3.2B events per year
7/28/2019 Online Learning with Streamdrill
4/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Event Data is Huge: Diversity
Potentially large spaces:
distinct words: >100k
IP addresses: >100M
users in a social network: >10M
http://www.flickr.com/photos/arenamontanus/269158554/http://wordle.net
7/28/2019 Online Learning with Streamdrill
5/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Recommenders
7/28/2019 Online Learning with Streamdrill
6/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Stream Mining to the rescue
Stream mining algorithms:
answer stream queries with finite resources
Typical examples:
how often does an item appear in a stream?
how many distinct elementsare in the stream?
what are the top-k most frequent items?
Continuous Stream of Data
Bounded ResourceAnalyzer Stream Queries
7/28/2019 Online Learning with Streamdrill
7/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
The Trade-Off
ExactFast
Big Data
Stream MiningMap Reduceand friends
First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra
http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandrahttp://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra7/28/2019 Online Learning with Streamdrill
8/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Heavy Hitters (a.k.a. Top-k)
Count activities over large item sets (millions,even more, e.g. IP addresses, Twitter users)
Interested in most active elements only.
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conferenceon Database Theory, 2005
frank
paul
jan
leo
conrad
stephan
15
12
8
5
32
Fixed tables of counts
Case 1: element already in data base
Case 2: new element
paul paul 12 13
hugo stephan 2
hugo 3
7/28/2019 Online Learning with Streamdrill
9/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Heavy Hitters over Time-Window
Time
DB
Keep quite a big log (amonth?)
Constant write/erase indatabase
Alternative: Exponentialdecay
7/28/2019 Online Learning with Streamdrill
10/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Exponential Decay
Instead of a fixed window, use exponential
decay
The beauty: updates are recursive
halftime
score
timestamp
time shift term
7/28/2019 Online Learning with Streamdrill
11/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Count-Min Sketches
Summarize histograms over large feature sets
Like bloom filters, but better
Query: Take minimum over all hash functions
0 0 3 01 1 0 2
0 2 0 00 3 5 2
0 5 3 2
2 4 5 0
1 3 7 3
0 2 0 8
m bins
n differenthash functions
Updates for new entryQuery result: 1
G. Cormode and S. Muthukrishnan.An improved data stream summary: The count-min sketch and its applications.LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
7/28/2019 Online Learning with Streamdrill
12/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
What you can do with StreamMining
Count things:
user interactions
items purchased
pages viewed
songs played
Trends:
most active pages
most active users
7/28/2019 Online Learning with Streamdrill
13/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
But there is more:Counting is Statistics
Empirical mean:
Correlations:
Principal Component Analysis:
7/28/2019 Online Learning with Streamdrill
14/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Profiling User Behavior
7/28/2019 Online Learning with Streamdrill
15/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Statistics and Outlier Detection
When does behavior change?
7/28/2019 Online Learning with Streamdrill
16/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Maximum-Likelihood
Estimate probabilistic models
But wait, how do I 1/n with randomly spaced
events?
based on
which is slightly
biased, butsimpler
7/28/2019 Online Learning with Streamdrill
17/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Outlier detection
Once you have a model, you can computep-values (based on recent time frames!)
7/28/2019 Online Learning with Streamdrill
18/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Online TF-IDF
7/28/2019 Online Learning with Streamdrill
19/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Online TF-IDF
estimate word document frequencies
for each word: update(word, t, 1.0)
for each document: update(#docs, t, 1.0)
query: score(word) / score(#docs)
7/28/2019 Online Learning with Streamdrill
20/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Streamdrill
Heavy Hitters counting + exponential decay
Instant counts & top-k results over timewindows.
Indices!
Snapshots for historical analysis
Beta demo available at http://streamdrill.com,
launch imminent
hi i
7/28/2019 Online Learning with Streamdrill
21/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Architecture Overview
http://streamdrill.com/7/28/2019 Online Learning with Streamdrill
22/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Example: Twitter Stock Analysis
http://play.streamdrill.com/vis/
7/28/2019 Online Learning with Streamdrill
23/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Example: Twitter Stock Analysis
http://play.streamdrill.com/vis/7/28/2019 Online Learning with Streamdrill
24/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Example: Twitter Stock Analysis
7/28/2019 Online Learning with Streamdrill
25/25
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun
Summary
Doesn't always have to be scaling!
Stream mining: Approximate results withfinite resources.
streamdrill: stream analysis engine