WI: IWIAugust 31st, 2010
Swit Phuvipadawat, Tsuyoshi MurataDept. of Computer Science
Tokyo Institute of Technology, Japan
Breaking News Detection and Tracking in Twitter
Outline
• Introduction
• Analysis
• Methodology
• Results and Application
• Challenges and Future Works
• Conclusion
Twitter as a news channel
http://blog.marsdencartoons.com/2009/06/18/cartoon-iranian-election-demonstrations-and-twitter/marsden-iran-twitter72/
In June 2009, during the Iranian Election Twitter has transformed the way people convey news.
Twitter as a news channel
Iraq Election
Earthquake with 6.4 magnitude hits Taiwan!
Tsunami alert after Chilean earthquake.
Early voting begin March, 7 Iraq Election
The Apple iPad starting $499
Apple announced iPad
Obama Health Reform
Earthquakes around the worldEarthquake with 6.4 magnitude hits Taiwan!
Earthquake with 6.4 magnitude hits Taiwan!
Tsunami alert after Chilean earthquake.
Early voting begin March, 7 Iraq Election
The Apple iPad starting $499The Apple iPad starting $499
Apple to launch iPad on March 26
The Apple iPad starting $499
Steve Jobs demoed iPad
Earthquake in Taiwan
Earthquake in Chile
Earthquake in Haiti
Early voting begin March, 7 Iraq Election
US and UN hope Sunni participation help heal the
wound.
Health care explained.
Research Topic
“Breaking News Detection and Tracking in Twitter”
➡ Topic Detection and Tracking (TDT)
➡ Information Retrieval
➡ Social Network Analysis
Topic Detection and Tracking (TDT)• To monitor broadcast news and alert an analyst to new and
interesting events happening in the world. [Allan 2001]
• To search, organize and structure multilingual, news oriented textual materials from a variety of broadcast news media. [Fiscus & Doddington 2002]
• Focuses on 5 tasks:
❖ Story segmentation
❖ First story detection
❖ Cluster detection
❖ Tracking
Recent Studies
• Topological characteristics of Twitter
What is Twitter, a Social Network or a News Media? H. Kwak, C. Lee, S. Moon [WWW2010]
➡ 85% of trending topics in Twitter appear in headline news
• Using Twitter data to improve web ranking
Time is of the Essence: Improving Recency Ranking Using Twitter Data A. Dong, R. Zhang et. al [WWW2010]
➡ Micro-blogging data reveals fresh URLs not yet indexed by search engine
• Event detection
Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors T. Sakaki, M. Okazaki, Y. Matsuo [WWW2010]
Recent Studies• In!uential Topics, Users detection
❖ Characterizing Microblogs with Topic Model D. Ramage, S. Dumais, D. Liebling [ICWSM2010]
➡ Use Labeled LDA, a supervised learning model to characterize the content of messages into substance, style, status and social characteristics.
❖ TwitterRank: Finding Topic-sensitive In!uential Twitters J. Weng, E. Peng, J. Jiang [WSDM2010]
➡ Use PageRank with topic model (LDA) to measure the in!uence of users.
Message Analysis
Msg Attributes Count %
Tag a user 79,469 51.6% Embed a link 50,404 32.7% Retweet 29,935 19.4% Use a hashtag 20,348 13.2%RT
@http://
#
Findings from a dataset of 154,000 msg.with 33,000 msg. from news engaging users
Text Characteristics Examples
Sensational adjectives E terrible, horrible, terrifying, shocking, terri"c, amazing, ...
Sensational phrases E wow! oh my god! ...
Signi"cant nouns F US. President, Obama, Michael Jackson, Japan, Toyota, ...
Impactful verbs F kill, die, crash, reveal, discover, rescue, ...
@
http://
#
Single Message Aspect
Data of March 2009
Network Analysis
RT
Timeline Aspect
A
A
M6
M3
RT (retweet) is to take a twitter message of someone and rebroadcasting that same
message
Earthquake in Tokyo!
John12:15
RT @John Earthquake in Tokyo!
Lisa 12:30
To retold a story to your friends
Method for Collecting, Indexing and Grouping
‣ Collecting
• Fetch messages using pre-de"ned search queries for breaking news related keyword and hashtags
‣ Indexing
• Index based on term vectors is constructed.
• Apache Lucene is used as an information retrieval library
‣ Grouping
• Similar messages are grouped together to form a news story
• Similarity comparison is based on the vector space model using TF-IDF with term boosting for proper nouns
Collecting
Grouping Method Explained
Conditions• Message in a group must be
related to the "rst story• Further messages can develop
upon previous messages
A message is compared with the "rst message in a group and the top k terms in that group.
sim(m1,m2 ) = tf (t,m2 ) ! idf (t) ! boost(t)[ ]t"m1
#
tf (t,m) = count(t in m)size(m)
idf (t) = 1+ log Ncount(m has t)
$%&
'()
Boost is raised for proper nouns e.g. China, Obama, Toyota and Hashtags. NER is used for detection
Name Entity Recognizer
• Stanford Named Entity Recognizer (NER) has been adopted for the following uses:
➡ To detect proper nouns used in the grouping algorithm
➡ To classify messages based on named entities (Person, Organization, Location, Misc.)
• NER is based on linear chain Conditional Random Field (CRF)
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-37
The score for each group is computed as follows:
• A group score is based on reliability, popularity and freshness factors.‣ Reliability comes from the
numbers of followers who follow the user who posted a message.‣ Popularity comes from the
numbers of retweet.‣ Freshness is computed from
the difference of current time and time where a message is posted.
Method for Group Ranking
Detection Effectiveness
RatesMethod
RatesSearch query
Precision 90.0% (45/50)
Recall -
Spam 8% (4/50)Avg. time to collect
100 new msg. 72 sec
User generated 11.1% (5/45)
Based on an experiment conducted in June 2010
Example Result of Grouping(a) No boost(a) No boost(a) No boost(a) No boost(a) No boost(a) No boost
G0 M3 M4 M5G1 M7 M8G2 M0 M1G3 M2G4 M6G5 M9
(b) b=1.5(b) b=1.5(b) b=1.5(b) b=1.5(b) b=1.5(b) b=1.5G0 M2 M3 M4 M5G1 M7 M8G2 M0 M1G4 M6G5 M9
(c) b=1.7(c) b=1.7(c) b=1.7(c) b=1.7(c) b=1.7(c) b=1.7G0 M0 M1 M7 M8G1 M2 M3 M4 M5G2 M6G3 M9
(c) b=2(c) b=2(c) b=2(c) b=2(c) b=2(c) b=2G0 M2 M3 M4 M5 M9G1 M0 M1 M7 M8G2 M6
ToyotaMJ.
AirlineUS. JapanPrisoner
Boosting improves the grouping result
Application
• A prototype application called Hotstream is developed.
• The goal is to create an automatic news portal based on Twitter data.
Challenges
• The length of messages is short
• Two similar stories may be expressed using different vocabulary terms
• The style of writting is unconventional with slangs, many ways for spellings
Future Works• Explore the comunity structures of named
entities to "nd relationship among groups of messages
Grouped by TF-IDF with proper
noun term boosting
Example Dataset
Top 18 stories and their keywords from Hotstream as of July 21st, 2010Red nodes = keywords, Yellow nodes = message groups
Messages-Named Entities
Community Detection Experiment
Method Edge betweeness
No. Communities 68
Modularity 0.71
Purity 0.67
BP Oil leak
Australian Prime Minister
US. Military in Middle East
Network Type Edge betweeness
No. Vertices 453 (254,200)
No. Edges 1280
Mean Degree 5.639
No. Clusters 40
Largest Component Fraction 0.781
Network Characteristics
Community Detection Results
Conclusion
• Introduced Twitter as a mean to convey news
• Described messages, network characteristics of Twitter
• Described the method to collect, index, group and rank messages
• Introduced Hotstream, an automatic news portal
• Propose an extension study on group-keyword network to improve the grouping result