Using TwitterEarly Detection of Trending Topics
D.C NLP Meetup
June 10, 2015
Topics
• Motivation• Underlying Theory • Challenge• Approach• Initial Results• Potential Implications
Timeline• 9.31AM – Explosion occurred• + 1 min – First Tweet• +20 min – Local news reported
Reference: https://gigaom.com/2014/03/16/how-twitter-confirmed-the-explosion-in-harlem-before-the-news-did/
Harlem Gas Explosion in NYC ( March 2014)
Find the ‘tweet-dle’ within the ‘tweet-stack’Motivation
Tweet
Tweet
Tweet ‘Interesting!’
‘Meh’
Me
Me
Action
No retweet/tweet
Retweet/tweet
Tweet
It’s Not Why We Share But How We ShareTheory
Step -wise
GradualQuick Rise
Tweet Rate Over Time Across Topics Implications• Multiple ways topics can ‘trend’• Approaches
– Parametric• Too many variations.
– Non-parametric • Support wide variations in
more automated fashion
Time Before Tagged As Trending Topic (min)
All Roads Lead to Rome!Challenge
Step -wise
GradualQuick Rise
Tweet Rate Over Time Across Topics
Time Before Tagged As Trending Topic (min)
Clustering(Used to classify new trends)
Time-Series ClusteringApproach
Data Collection Feature Engineering Modeling
K-Means Clustering
Tweet Split trending vs
Non trending
Topics Filter for topic
of the day
Tweets(Streaming API)
Topics(Trend API)
Notes:• Streaming API: 1% of tweets• English only• 2 weeks sample ( Jan’15)
Tweet Normalization/ Interpolation
Topic Identification• Trending ( #, unigrams)• Non-trending (#)
Trending Topics• Exclude recurring or spurious• Include topic within 24hrs
Distance metric• Use dynamic time warping to align time series
Data PipelineApproach
1. Normalization• Time series plot based on tweet rate• Fixed length ( 120min)• Tweet rate based on tweet 120 min ago
2. Linear Interpolation• Due to streaming API, 1% of tweets• Gaps in the data
1
Topics
Tweets
On-going Event:
Wimbledon9 Iowa State
Spurious:
Time for Pretty Little LiarsThe Weekend - Earned It
Topic of the Day:
State of the UnionUnityMarch
(Less than 30mins) (More than 24 hours) (Within 24 hours and more than 30mins)
Excluded Included
2
Feature EngineeringApproach
K- Means Clustering with Dynamic Time Warping• Similar to speech – identify same word but said by diff people• Distance metric is Euclidean distance
Alignment using Dynamic Time Warping
Before…
ModelingApproach
…After
Step-wise
Time(min) – Before Trending
Tweet Rate %
Step -wise Burst Gradual
Steady blimps blimps blimps
Time(min)
Tweet Rate %
‘Library’ Of TrendsInitial Results
• Labeling - Identification of Trending Topics• Forecasting – Ranking of Topics by Volume• Other social media streams ( Tumblr,
Instagram etc)
Potential ImplicationsNext Steps
Recommended