Tutorial on Topic Modelling from messages on the Internet

Образец заголовка

Tutorial on Topic modelling

from messages on the

Internet

by Xiang Kong

Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Образец заголовкаPlan

• Motivation

• Methods

• Conclusion

• Relevant work

Образец заголовкаWhy we are interested

• Millions of messages will be generated on

the Internet (twitter, Facebook, LinkedIn,

• Understand interests of a large amount of

people

• Recognizing topics in the real world

Образец заголовкаMore about Why

According to Cowen & Co Predictions &

Report:

• Twitter had 241 million monthly active

users at the end of 2013

• Twitter reaches 270 million monthly active

users by the end of 2014

Образец заголовкаCan tradition method work?

• Tradition method (PLSA, LDA) will not work

very well on Internet massages.

Results could not represent what exactly

happens on the Internet

(from Hong L, 2010)

Образец заголовкаDifference

• Usually, they are very short (tweets)

• Different languages in one short message

• Unstructured

• Abbreviations

Образец заголовкаWe need some changes

• Combining with extra sources (hashtags in

twitter)

• Clustering-based topic extraction

• Vision information

Образец заголовкаHashtag

• Providing context or metadata for online

messages (tweets for example)

• Organizing the information in the online

messages for retrieval

• Helping to find latest trends

• Helping to get more audience

Fig 1 Hashtag example

Образец заголовкаPerformance Improvance(Xinfan Meng, Furu We, Xiaohua Liu et al

Topic extraction from tweets using the graph-based

methods with the help of hashtags (Labeled LDA）.

• Co-occurrence and Cosine method produce more

clusters and smaller clusters than the Labeled LDA

method.

• The distributional similarity approach (Labeled LDA)

based on hashtags can greatly improve the performance

Fig 2. Accuracy of topic extraction for different methods

Образец заголовкаMissing hashtags

• Users tweet history

• Social graph

• Influential friends

• Temporal Information

Образец заголовкаDifferent clustering（Ming Xie and Yunlu Zhang）

Optimizing density-based clustering OPTICS

algorithm• It uses WordNet for word sense disambiguation of words

in the learning resources documents

• It maps the data space of the original method to a vector

space of sentence, improving the original OPTICS

algorithm.

Образец заголовкаVision-based (1) (Qingshui Li and KaiWu 2010)

• The vision information (navigation bars, banner

bar, etc.) of Web could avoid using the

sophisticate natural language processing

technology

• Analyzing the vision character of page block and

finally accurate determine the topic data region.

Образец заголовкаVision-based (2) (Qingshui Li and KaiWu 2010)

• First detect whether the Web page contains a specific tag, if there is a specific tag, then we analyze it according to a specific tag.

• From information in these vision tags, topic will be extracted more efficiently.

• Topics usually displays by large font, significant position or different color.

• Otherwise, some topics do not display by special format, even it has not topic. In this case, they use the frequency algorithm to carry out the topic extraction.

NB: VB in the sample means vision block.

Образец заголовкаTopic emerging time(Adrien Bougouin and Florian Boudin)

(Mario Cataldi, Luigi Di Caro and Claudio Schifanella )

We want to how the topics on the Internet changes

• Extracting the contents according to a novel

aging theory

• Analyzing the social relationships in the network

with the well-known Page Rank algorithm in

order to determine the authority of the users.

• Finally, we leverage a navigable topic graph,

allowing the detection of the emerging topics,

under user-specified time constraints.

Образец заголовкаNovel aging theory(1)

• Many conventional clustering and classification strategies can not be applied to this problem due to the fact that they tend to ignore the temporal relationships among documents (tweets in our case) related to a news event.

• we can evaluate the usage of a keyword by its energy, which indicates the vitality status of the keyword and can qualify the keyword’s usage. In fact, a high energy value implies that the term is becoming important in the considered community, while a low energy value implies that it is currently becoming out of favor.

Образец заголовкаNovel aging theory(2)

Fig .4 Statistical usage of the term “earthquake” in Twitter from October 2009 to January 2010; the pick represents earthquake occurred in Haiti on 12 January 2010.

Fig 5 A Topic graph with two Strongly Connected Components (in red and blue) representing two different emerging topics: labels in bold represent emerging keywords while the thickness of an edge represents the semantical relationship between the considered keywords.

Образец заголовкаAfter extracting topics

• Recommendation system (Brendan O'

Connor, Michel Krieger and David Ahn)

• Topic evolution (Yookyung Jo, John

E.Hopcroft, Carl Lagoze)

Образец заголовкаAn example(1)Mathieu Bastian, Matthew Hayes, William Vaughan et al

“Skills and Expertise” is a data-driven feature on LinkedIn, the world’s largest professional online social network, which allows members to tag themselves with topics representing their areas of expertise.

Образец заголовкаAn example (2)

• Folksonomy creation

Entity extraction

Clustering to provide context

• Skills Inference and Recommendation

Naive Bayes Classifier to detect

likelihood of having a skill for a user

Образец заголовкаPractical design

• Large data scale

Hadoop (Mapreduce) framework

Dimensionality reduction (PCA, ICA)

• Topic extraction

unsupervised training (EM, DNN, etc)

Образец заголовкаConclusion

• Necessary to extract summarization of

messages on the Internet

• Some difficulties for online messages but

also give us some extra clues (vison)

• Some applications (LinkedIn skills

inference system)

Образец заголовкаFurtherwork (1)

• It is a popular field right now due to

Internet popularity

• How to handle online messages is also a

open field

Highly unstructured data

Short but meaningful messages

• How to make implementations faster to

cluster online messages

Образец заголовкаFurtherwork(2)

• Mining the relationships between topics,

topic evolution thread discovery and

textual mining on evolution threads.

• Building a navigational application from

the graphs concrete information, i.e.

through “edges: between topics in the

graph model.

Образец заголовкаReferences

• Papers used in this tutorial are in this file

https://subversion.ews.illinois.edu/svn/sp16-

cs410/xkong12/progress.pdf

Tutorial on Topic Modelling from messages on the Internet

Data & Analytics

MODELLING AND SIMULATION ON RECYCLING OF ELECTRIC … · MODELLING AND SIMULATION ON RECYCLING ... we present the simulation models with Anylogic. ... Modelling and Simulation on

Information Modelling to Canonical Messages made easy

Take on messages from Lecture 1

Identifying actionable messages on social media

How to send Bulk Messages on Linkedin

Messages on "Messenger"

Key Messages Stratford-on-Avon

Three key messages on tuberculosis control

Radar Target Modelling Based on RCS Measurements18242/FULLTEXT01.pdf · Radar Target Modelling Based on RCS Measurements ... considers a study of radar target modelling based on Inverse

on Modelling of Machining Operations - KIT · 15th CIRP Conference on Modelling of Machining Operations 15th CIRP Conference on Modelling of Machining Operations from June 11-12,

ESSENTIAL MESSAGES FROM ESC GUIDELINES · essential messages from esc guidelines ... essential messages from 2013 esc guidelines on cardiac pacing and cardiac resynchronization therapy

Evaluation of Road Weather Messages on DMS Based on

Kernel Messages - May, 2009Linux on System z Kernel Messages May, 2009 LinuxKernel2.6-Developmentstream SC34-2599-00

How to Restore WhatsApp Messages on Android

Why Pass on Viral Messages

On the numerical modelling of impinging jets heat transferlada/postscript_files/Mirko_Bovo_Lic_thesis.pdf · On the numerical modelling of impinging jets ... On the numerical modelling

Messages on prayer

KEY MESSAGES FOR TEACHERS ON CORONAVIRUS

Booklet on Key Essential Nutrition Actions Messages

MESSAGES REMOTE ON CURSOR SELECT CHANNEL POWER …