11
Using Twitter Early Detection of Trending Topics D.C NLP Meetup June 10, 2015

DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Embed Size (px)

Citation preview

Page 1: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Using TwitterEarly Detection of Trending Topics

D.C NLP Meetup

June 10, 2015

Page 2: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Topics

• Motivation• Underlying Theory • Challenge• Approach• Initial Results• Potential Implications

Page 3: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Timeline• 9.31AM – Explosion occurred• + 1 min – First Tweet• +20 min – Local news reported

Reference: https://gigaom.com/2014/03/16/how-twitter-confirmed-the-explosion-in-harlem-before-the-news-did/

Harlem Gas Explosion in NYC ( March 2014)

Find the ‘tweet-dle’ within the ‘tweet-stack’Motivation

Page 4: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Tweet

Tweet

Tweet ‘Interesting!’

‘Meh’

Me

Me

Action

No retweet/tweet

Retweet/tweet

Tweet

It’s Not Why We Share But How We ShareTheory

Page 5: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Step -wise

GradualQuick Rise

Tweet Rate Over Time Across Topics Implications• Multiple ways topics can ‘trend’• Approaches

– Parametric• Too many variations.

– Non-parametric • Support wide variations in

more automated fashion

Time Before Tagged As Trending Topic (min)

All Roads Lead to Rome!Challenge

Page 6: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Step -wise

GradualQuick Rise

Tweet Rate Over Time Across Topics

Time Before Tagged As Trending Topic (min)

Clustering(Used to classify new trends)

Time-Series ClusteringApproach

Page 7: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Data Collection Feature Engineering Modeling

K-Means Clustering

Tweet Split trending vs

Non trending

Topics Filter for topic

of the day

Tweets(Streaming API)

Topics(Trend API)

Notes:• Streaming API: 1% of tweets• English only• 2 weeks sample ( Jan’15)

Tweet Normalization/ Interpolation

Topic Identification• Trending ( #, unigrams)• Non-trending (#)

Trending Topics• Exclude recurring or spurious• Include topic within 24hrs

Distance metric• Use dynamic time warping to align time series

Data PipelineApproach

Page 8: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

1. Normalization• Time series plot based on tweet rate• Fixed length ( 120min)• Tweet rate based on tweet 120 min ago

2. Linear Interpolation• Due to streaming API, 1% of tweets• Gaps in the data

1

Topics

Tweets

On-going Event:

Wimbledon9 Iowa State

Spurious:

Time for Pretty Little LiarsThe Weekend - Earned It

Topic of the Day:

State of the UnionUnityMarch

(Less than 30mins) (More than 24 hours) (Within 24 hours and more than 30mins)

Excluded Included

2

Feature EngineeringApproach

Page 9: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

K- Means Clustering with Dynamic Time Warping• Similar to speech – identify same word but said by diff people• Distance metric is Euclidean distance

Alignment using Dynamic Time Warping

Before…

ModelingApproach

…After

Page 10: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Step-wise

Time(min) – Before Trending

Tweet Rate %

Step -wise Burst Gradual

Steady blimps blimps blimps

Time(min)

Tweet Rate %

‘Library’ Of TrendsInitial Results

Page 11: DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

• Labeling - Identification of Trending Topics• Forecasting – Ranking of Topics by Volume• Other social media streams ( Tumblr,

Instagram etc)

Potential ImplicationsNext Steps