27
LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox April 30, 2015

LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

LDA: Extracting Topics from Tweets and Webpages

Sarunya Pumma Xiaoyang Liu

CS5604 Information Retrieval Instructor: Dr. Edward Fox

April 30, 2015

Page 2: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

OutlineIntroduction

The IDEAL System LDA

Design and Implementation Data Preparation Topic Extraction Data Storing

Evaluation Human Judgement Evaluation Against the Clustering Team

Conclusion

Page 3: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Introduction

Page 4: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

The IDEAL Project__________________

_

URLs

Nutch Noise

Classification

Clustering

LDA

NER

IndexingHBase

Solr

__________________

_

Tweets*

__________________

_

Webpages

*

__________________

_

Cleaned Tweets

*

__________________

_

Cleaned Webpages

*Inter-dependence between teams

4

Page 5: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

What is LDA?

Latent Dirichlet Allocation (LDA) is a generative topic model

Each document is a mixture of topics

Each word is attributable to one of the document’s topics

5

Page 6: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Why is LDA Useful?Extract topics of documents and words associated with topics Find semantical similarity between documents Unsupervised method: training model is not needed Dimensionality reduction: lower-dimensional representation for documents in corpus

6

Page 7: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

How Does LDA Work?

Each document w in a corpus D: 1. Choose N ∼ Poisson(ξ). 2. Choose θ ∼ Dir(α). 3. For each of the N words wn:

Choose a topic zn ∼ Multinomial(θ). Choose a word wn from p(wn|zn,β), a multinomial probability conditioned on the topic.

7

Page 8: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

How Does LDA Work?Example:

Sentence 1: Monkeys like banana Sentence 2: Cats like fish Sentence 3: Banana and apple are popular fruits

sentence 1 contains 50% topic 1 and 50% topic 2 sentence 2 contains 100% topic 2 sentence 3 contains 100% topic 1

Topic 1 (plant): banana apple orange grape … Topic 2 (animal): cat fish tiger lion …

8

Page 9: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Design & Implementation

Page 10: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Architectural Design

Hadoop & HDFS

HBaseMahout

Data Preparation Data StoringTopic

ExtractionDoc-Topic

DistributionTweets/

Webpages (AVRO)

Sequence FileDoc-Topic

Distribution (AVRO)

10

Page 11: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(1) Data PreparationTweets

Use the cleaned data Use the Java program to convert AVRO file to sequence file

Webpages Crawl the webpages based on the URLs Use Mahout to convert webpages to sequence file

11

Page 12: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Architectural Design

Hadoop & HDFS

HBaseMahout

Data Preparation Data StoringTopic

ExtractionDoc-Topic

DistributionTweets/

Webpages (AVRO)

Sequence FileDoc-Topic

Distribution (AVRO)

12

Page 13: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(2) Topic ExtractionImplementation

Java using Mahout library (CVB - Collapsed Variational Bayesian Inference)

Steps 1. Convert the sequence file to a sparse vector

based on TF-IDF 2. Decompose the vector to singular value

decomposition vectors (SVD) 3. Run CVB algorithm on SVD vectors

13

Page 14: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(2) Topic Extraction

Output Document-topic distribution

Structure (N topics) {topic1:P1(topic1),topic2:P1(topic2),…,topicN:P1(topicN)}

.

.

. {topic1:PM(topic1),topic2:PM(topic2),…,topicN:PM(topicN)}} M

documents

14

Page 15: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(2) Topic Extraction

How many topics for each collection? Empirical study!

How to evaluate the quality of LDA?

Kullback Leibler (KL) Divergence

15

Page 16: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(2) Topic ExtractionLDA Evaluation 1. Group the documents

based on X top topics

2. Compute the average KL divergence in the cluster

3. Compute the average KL of all the clusters

C2

C1

C3

KL2

KL1

KL3

KLLDA

16

Page 17: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(2) Topic ExtractionSmall collection

KL Di

verg

ence

0

0.1

0.2

0.3

0.4

Number of Topics3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

3 Top-Topics 4 Top-Topics 5 Top-Topics

17

Page 18: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(2) Topic ExtractionLarge collection (Top topics = 3)

KL Di

verg

ence

0

0.065

0.13

0.195

0.26

Number of topics3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48

18

Page 19: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Architectural Design

Hadoop & HDFS

HBaseMahout

Data Preparation Data StoringTopic

ExtractionDoc-Topic

DistributionTweets/

Webpages (AVRO)

Sequence FileDoc-Topic

Distribution (AVRO)

19

Page 20: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

(3) Data Storing

On-going process We have talked to the Hadoop team

The output will be converted to AVRO file It will be loaded into HBase using the script provided by the Hadoop team

20

Page 21: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Evaluation

Page 22: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Word intrusionw (m,k) is the intruding word among the words of k th topic generated by mth model.

i(m,k,s) is the intruder selected by subject s on the testing set.

S is the number of subjects

find the word which doesn’t belong with the others

{dog, cat, horse, apple, pig , cow}

Human Judgement

22

Page 23: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Topic intrusionθ is the possibility of the topic in the documents

Index * means it is the true intruder

index s, j donate to the intruded topic selected by subjects

Finding the topics which don’t consistent with the others

Higher the value of TLO and MP, greater the correspondence

Human Judgement

23

Page 24: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Evaluation Against the Clustering Team

On-going process Measurement: Cosine Similarity Comparison: T-Test Hypotheses

H0: μLDA - μClustering > 0 H1: μLDA - μClustering <= 0

24

Page 25: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

Conclusion

Page 26: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

ConclusionWe have applied LDA on tweets and webpages

The number of topics for each collection is determined by the empirical study

Topics quality will be evaluated by human judgement and cross validation with the Hadoop team

The results from the remaining works will be presented in the final report

26

Page 27: LDA: Extracting Topics from Tweets and Webpages...LDA: Extracting Topics from Tweets and Webpages Sarunya Pumma Xiaoyang Liu CS5604 Information Retrieval Instructor: Dr. Edward Fox

-LDA team

:)