View
528
Download
0
Category
Preview:
Citation preview
Образец заголовка
Tutorial on Topic modelling
from messages on the
Internet
by Xiang Kong
Prepared as an assignment for CS410: Text Information Systems in Spring 2016
Образец заголовкаPlan
• Motivation
• Methods
• Conclusion
• Relevant work
Образец заголовкаWhy we are interested
• Millions of messages will be generated on
the Internet (twitter, Facebook, LinkedIn,
etc.)
• Understand interests of a large amount of
people
• Recognizing topics in the real world
Образец заголовкаMore about Why
According to Cowen & Co Predictions &
Report:
• Twitter had 241 million monthly active
users at the end of 2013
• Twitter reaches 270 million monthly active
users by the end of 2014
Образец заголовкаCan tradition method work?
• Tradition method (PLSA, LDA) will not work
very well on Internet massages.
Results could not represent what exactly
happens on the Internet
(from Hong L, 2010)
Образец заголовкаDifference
• Usually, they are very short (tweets)
• Different languages in one short message
• Unstructured
• Abbreviations
Образец заголовкаWe need some changes
• Combining with extra sources (hashtags in
twitter)
• Clustering-based topic extraction
• Vision information
Образец заголовкаHashtag
• Providing context or metadata for online
messages (tweets for example)
• Organizing the information in the online
messages for retrieval
• Helping to find latest trends
• Helping to get more audience
Fig 1 Hashtag example
Образец заголовкаPerformance Improvance(Xinfan Meng, Furu We, Xiaohua Liu et al
Topic extraction from tweets using the graph-based
methods with the help of hashtags (Labeled LDA).
• Co-occurrence and Cosine method produce more
clusters and smaller clusters than the Labeled LDA
method.
• The distributional similarity approach (Labeled LDA)
based on hashtags can greatly improve the performance
Fig 2. Accuracy of topic extraction for different methods
Образец заголовкаMissing hashtags
• Users tweet history
• Social graph
• Influential friends
• Temporal Information
Образец заголовкаDifferent clustering(Ming Xie and Yunlu Zhang)
Optimizing density-based clustering OPTICS
algorithm• It uses WordNet for word sense disambiguation of words
in the learning resources documents
• It maps the data space of the original method to a vector
space of sentence, improving the original OPTICS
algorithm.
Образец заголовкаVision-based (1) (Qingshui Li and KaiWu 2010)
• The vision information (navigation bars, banner
bar, etc.) of Web could avoid using the
sophisticate natural language processing
technology
• Analyzing the vision character of page block and
finally accurate determine the topic data region.
Образец заголовкаVision-based (2) (Qingshui Li and KaiWu 2010)
• First detect whether the Web page contains a specific tag, if there is a specific tag, then we analyze it according to a specific tag.
• From information in these vision tags, topic will be extracted more efficiently.
• Topics usually displays by large font, significant position or different color.
• Otherwise, some topics do not display by special format, even it has not topic. In this case, they use the frequency algorithm to carry out the topic extraction.
NB: VB in the sample means vision block.
Образец заголовкаTopic emerging time(Adrien Bougouin and Florian Boudin)
(Mario Cataldi, Luigi Di Caro and Claudio Schifanella )
We want to how the topics on the Internet changes
• Extracting the contents according to a novel
aging theory
• Analyzing the social relationships in the network
with the well-known Page Rank algorithm in
order to determine the authority of the users.
• Finally, we leverage a navigable topic graph,
allowing the detection of the emerging topics,
under user-specified time constraints.
Образец заголовкаNovel aging theory(1)
• Many conventional clustering and classification strategies can not be applied to this problem due to the fact that they tend to ignore the temporal relationships among documents (tweets in our case) related to a news event.
• we can evaluate the usage of a keyword by its energy, which indicates the vitality status of the keyword and can qualify the keyword’s usage. In fact, a high energy value implies that the term is becoming important in the considered community, while a low energy value implies that it is currently becoming out of favor.
Образец заголовкаNovel aging theory(2)
Fig .4 Statistical usage of the term “earthquake” in Twitter from October 2009 to January 2010; the pick represents earthquake occurred in Haiti on 12 January 2010.
Fig 5 A Topic graph with two Strongly Connected Components (in red and blue) representing two different emerging topics: labels in bold represent emerging keywords while the thickness of an edge represents the semantical relationship between the considered keywords.
Образец заголовкаAfter extracting topics
• Recommendation system (Brendan O'
Connor, Michel Krieger and David Ahn)
• Topic evolution (Yookyung Jo, John
E.Hopcroft, Carl Lagoze)
Образец заголовкаAn example(1)Mathieu Bastian, Matthew Hayes, William Vaughan et al
“Skills and Expertise” is a data-driven feature on LinkedIn, the world’s largest professional online social network, which allows members to tag themselves with topics representing their areas of expertise.
Образец заголовкаAn example (2)
• Folksonomy creation
Entity extraction
Clustering to provide context
• Skills Inference and Recommendation
Naive Bayes Classifier to detect
likelihood of having a skill for a user
Образец заголовкаPractical design
• Large data scale
Hadoop (Mapreduce) framework
Dimensionality reduction (PCA, ICA)
• Topic extraction
unsupervised training (EM, DNN, etc)
Образец заголовкаConclusion
• Necessary to extract summarization of
messages on the Internet
• Some difficulties for online messages but
also give us some extra clues (vison)
• Some applications (LinkedIn skills
inference system)
Образец заголовкаFurtherwork (1)
• It is a popular field right now due to
Internet popularity
• How to handle online messages is also a
open field
Highly unstructured data
Short but meaningful messages
• How to make implementations faster to
cluster online messages
Образец заголовкаFurtherwork(2)
• Mining the relationships between topics,
topic evolution thread discovery and
textual mining on evolution threads.
• Building a navigational application from
the graphs concrete information, i.e.
through “edges: between topics in the
graph model.
Образец заголовкаReferences
• Papers used in this tutorial are in this file
https://subversion.ews.illinois.edu/svn/sp16-
cs410/xkong12/progress.pdf
Recommended