Emerging topic detection on twitter based on temporal and social terms evaluation

Preview:

Citation preview

Emerging Topic Detection on Twitter based on Temporal and Social Terms EvaluationKDD 2010 Workshop on Multimedia Data Mining

Chin Hui Chen (陳晉暉 )

Agenda

• Introduction• The Main Steps • Content Extraction• User Authority• Content Aging Theory• Selection of Emerging Terms• From Emerging Terms to Emerging Topics

• Experiments and Evaluation

Introduction

• Twitter.com• 75 million users on December 2009.• 6.2 million new accounts/per month (2-3 per second)

• People post tweets for …• Daily chatter • Conversations• Sharing information/URLs• Reporting news

Introduction (con’t)

• One of the founders of Twitter.com …

• A low level information news flashes portal.

Introduction (con’t)

• Target : Extract the emerging topics.• Process : • Content Extraction• User Authority• Content Aging Theory• Selection of Emerging Terms• From Emerging Terms to Emerging Topics

Agenda

• Introduction• The Main Steps • Content Extraction• User Authority• Content Aging Theory• Selection of Emerging Terms• From Emerging Terms to Emerging Topics

• Experiments and Evaluation

Step 1: Content Extraction

• Target : Tweets => Vector• t-th considered interval :

• Each tweet => tweet vector

Content Extraction (con’t)

where , = vocabulary size.

where , is the term freq value of the x-th vocab terms in j-th tweet, and returns the highest term freq value of the j-th tweet.

Step 2: User Authority

• Target : Which User is Important ?

• Define an author-based graph G(U,F) , where U is the set of users and F is the set of directed edges.

follower

User Authority (con’t)

User Authority (con’t)

• Compute Authority • => PageRank

User Authority (con’t)

Step 3: Content Aging Theory

• Target : Find Emerging Term.

• An Emerging keyword can be viewed as a semantic unit which links to a very recent news event.

• Chien Chin Chen, Yao-Tsung Chen, Yeali S. Sun, Meng Chang Chen: Life Cycle Modeling of News Events Using Aging Theory. ECML 2003

• See each term as a living organism:• With nourishment => life cycle is prolonged. => high energy• Without nourishment => die => low energy

Content Aging Theory (con’t)

• Term with high energy => important currently• Term with low energy => out of favor

• So, we need to know how to compute Nutrition and Energy.• Content Nutrition• Content Energy

Content Aging Theory (con’t) – Content Nutrition• Each food brings a different calory contribution depending on

its ingredients.• Different tweets containing the same keyword generate

different amount of nutrition.• Define the amount of nutrition :

Content Aging Theory (con’t) – Content Energy• Now we obtained the nutrition of a semantic unit => map into

energy => effective contribution (how much it is emergent).

• Hot Terms :

• Emergent Terms :

Content Aging Theory (con’t) – Content Energy

Content Aging Theory (con’t) – Content Energy• Define s = number of previous time slots.

Step 4: Selection of Emerging Terms• Target : How to select emerging keywords.• 1. Supervised

• ( )• 2. Unsupervised• Dynamically sets the critical drop• CoSeNa: a Context-based Search and Navigation System

Step 5: From Emerging Terms to Emerging Topics

• Target : Find Emerging Topics!

• Define topic as a minimal set of a terms semantically related to an emerging keyword.

• “victory”• Nov 2008 : “elections”, “Obama”, “USA” • Feb 2010 : “football”, “superbowl”, “New Orleans Saints”

• Method : co-occurrences

From Emerging Terms to Emerging Topics• 1. Generate Correlation Vector

• a. the keyword k as query.• b. the set of tweets containing k as relevance feedback.• c. relying on probabilistic feedback mechanism.

From Emerging Terms to Emerging Topics• 2. Construct Topic Graph

Keyword-based topic graph :

Thinning.

From Emerging Terms to Emerging Topics• 3. Topic Detection and Ranking

From Emerging Terms to Emerging Topics• Find SCC (Strongly Connect Component) :

• Emerging Topic as a subgraph representing a set of keywords semantically related to term z within the time interval.

Use DFS.

From Emerging Terms to Emerging Topics• Ranking

From Emerging Terms to Emerging Topics

Experiments and Evaluation

• Dataset : • 15 days (between 13th and 28th of April 2010)• More than 3 millions of tweets ( 10k/hr )• More then 300k different keywords

Real Case Study

• Set r = 15 mins , time slot s = 200. (2 solar days)• Result :

History Worthiness• Analyze two diff number of considered slots, s=100 and s=200.

History Worthiness (con’t)• “morning” => periodic events

History Worthiness (con’t)

• Life status of a keyword depends => number of time intervals.• Temporal relevance of the retrieved

topics. (Relevance是跟時間有關 )

Conclusion

• 1. Formalized the Keyword Life Cycle.• (now => frequently , past => rare)• 2. Study the Social Relationships.• 3. Formalized the Keyword-based Topic

Graph.

Recommended