Online Learning with Streamdrill

Embed Size (px)

Citation preview

  • 7/28/2019 Online Learning with Streamdrill

    1/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Online Learning WithStreamdrill

    Mikio L. Braun, @mikiobraunhttp://blog.mikiobraun.de

    TWIMPACT

    http://twimpact.com

    http://blog.mikiobraun.de/http://twimpact.com/http://twimpact.com/http://blog.mikiobraun.de/
  • 7/28/2019 Online Learning with Streamdrill

    2/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Event Data

    Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

    FinanceFinance GamingGaming MonitoringMonitoring

    Social MediaSocial Media

    Sensor NetworksSensor NetworksAdvertismentAdvertisment

  • 7/28/2019 Online Learning with Streamdrill

    3/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Event Data is Huge: Volume

    The problem: You easily get A LOT OF DATA!

    100 events per second

    360k events per hour

    8.6M events per day

    260M events per month

    3.2B events per year

  • 7/28/2019 Online Learning with Streamdrill

    4/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Event Data is Huge: Diversity

    Potentially large spaces:

    distinct words: >100k

    IP addresses: >100M

    users in a social network: >10M

    http://www.flickr.com/photos/arenamontanus/269158554/http://wordle.net

  • 7/28/2019 Online Learning with Streamdrill

    5/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Recommenders

  • 7/28/2019 Online Learning with Streamdrill

    6/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Stream Mining to the rescue

    Stream mining algorithms:

    answer stream queries with finite resources

    Typical examples:

    how often does an item appear in a stream?

    how many distinct elementsare in the stream?

    what are the top-k most frequent items?

    Continuous Stream of Data

    Bounded ResourceAnalyzer Stream Queries

  • 7/28/2019 Online Learning with Streamdrill

    7/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    The Trade-Off

    ExactFast

    Big Data

    Stream MiningMap Reduceand friends

    First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra

    http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandrahttp://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra
  • 7/28/2019 Online Learning with Streamdrill

    8/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Heavy Hitters (a.k.a. Top-k)

    Count activities over large item sets (millions,even more, e.g. IP addresses, Twitter users)

    Interested in most active elements only.

    Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conferenceon Database Theory, 2005

    frank

    paul

    jan

    leo

    conrad

    stephan

    15

    12

    8

    5

    32

    Fixed tables of counts

    Case 1: element already in data base

    Case 2: new element

    paul paul 12 13

    hugo stephan 2

    hugo 3

  • 7/28/2019 Online Learning with Streamdrill

    9/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Heavy Hitters over Time-Window

    Time

    DB

    Keep quite a big log (amonth?)

    Constant write/erase indatabase

    Alternative: Exponentialdecay

  • 7/28/2019 Online Learning with Streamdrill

    10/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Exponential Decay

    Instead of a fixed window, use exponential

    decay

    The beauty: updates are recursive

    halftime

    score

    timestamp

    time shift term

  • 7/28/2019 Online Learning with Streamdrill

    11/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Count-Min Sketches

    Summarize histograms over large feature sets

    Like bloom filters, but better

    Query: Take minimum over all hash functions

    0 0 3 01 1 0 2

    0 2 0 00 3 5 2

    0 5 3 2

    2 4 5 0

    1 3 7 3

    0 2 0 8

    m bins

    n differenthash functions

    Updates for new entryQuery result: 1

    G. Cormode and S. Muthukrishnan.An improved data stream summary: The count-min sketch and its applications.LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .

  • 7/28/2019 Online Learning with Streamdrill

    12/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    What you can do with StreamMining

    Count things:

    user interactions

    items purchased

    pages viewed

    songs played

    Trends:

    most active pages

    most active users

  • 7/28/2019 Online Learning with Streamdrill

    13/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    But there is more:Counting is Statistics

    Empirical mean:

    Correlations:

    Principal Component Analysis:

  • 7/28/2019 Online Learning with Streamdrill

    14/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Profiling User Behavior

  • 7/28/2019 Online Learning with Streamdrill

    15/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Statistics and Outlier Detection

    When does behavior change?

  • 7/28/2019 Online Learning with Streamdrill

    16/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Maximum-Likelihood

    Estimate probabilistic models

    But wait, how do I 1/n with randomly spaced

    events?

    based on

    which is slightly

    biased, butsimpler

  • 7/28/2019 Online Learning with Streamdrill

    17/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Outlier detection

    Once you have a model, you can computep-values (based on recent time frames!)

  • 7/28/2019 Online Learning with Streamdrill

    18/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Online TF-IDF

  • 7/28/2019 Online Learning with Streamdrill

    19/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Online TF-IDF

    estimate word document frequencies

    for each word: update(word, t, 1.0)

    for each document: update(#docs, t, 1.0)

    query: score(word) / score(#docs)

  • 7/28/2019 Online Learning with Streamdrill

    20/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Streamdrill

    Heavy Hitters counting + exponential decay

    Instant counts & top-k results over timewindows.

    Indices!

    Snapshots for historical analysis

    Beta demo available at http://streamdrill.com,

    launch imminent

    hi i

  • 7/28/2019 Online Learning with Streamdrill

    21/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Architecture Overview

    http://streamdrill.com/
  • 7/28/2019 Online Learning with Streamdrill

    22/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Example: Twitter Stock Analysis

    http://play.streamdrill.com/vis/

  • 7/28/2019 Online Learning with Streamdrill

    23/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Example: Twitter Stock Analysis

    http://play.streamdrill.com/vis/
  • 7/28/2019 Online Learning with Streamdrill

    24/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Example: Twitter Stock Analysis

  • 7/28/2019 Online Learning with Streamdrill

    25/25

    Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

    Summary

    Doesn't always have to be scaling!

    Stream mining: Approximate results withfinite resources.

    streamdrill: stream analysis engine