Transcript
Page 1: Survey- topical phrases and exploration of text corpora by topics

Topical Phrases and Exploration of Text Corpora by Topics

CS 512 – Spring 2014Professor: Jiawei Han

University of Illinois at Urbana-ChampaignDEPARTMENT OF COMPUTER SCIENCE

Ahmed El-KishkyMehrdad Biglari

Ramesh Kolavennu

Page 2: Survey- topical phrases and exploration of text corpora by topics

Introduction

• This survey is in relation to a system for text exploration based on the framework ToPMine– Scalable Topical Phrase Mining from Large Text

Corpora, El-Kishky, Song, Wang, Voss, Han

Page 3: Survey- topical phrases and exploration of text corpora by topics

Outline• Unigram Topic Modeling

– Latent Dirichlet Allocation

• Post Topic Model Topical Phrase Construction– Turbo Topics– Keyphrase Extraction and Ranking by Topic (KERT)– Constructing Topical Hierarchies

• Inferring Phrases and Topic– Bigram Language Model– Topical N-grams– Phrase-Discovering LDA

• Separation of Phrase and Topic– ToPMine

• Constraining LDA– Hidden Topic Markov Model– Sentence LDA

• Frequent Pattern Mining and Topic Models– Sentence LDA

• Phrase Extraction Literature– Common Phrase Extraction Techniques

• Visualization– Visualizing Topic Models

Page 4: Survey- topical phrases and exploration of text corpora by topics

Latent Dirichlet Allocation

LDA is a generative topic model that posits that each document is a mixture of a small number of topics, with each topic a word multinomial.1. Sample a distribution over topics from a Dirichlet prior.2. For each word in the document

– Randomly choose a topic from the distribution over topics in step #1.

– Randomly choose a word from the corresponding distribution over the vocabulary.

Page 5: Survey- topical phrases and exploration of text corpora by topics

LDA Sample Topics

Page 6: Survey- topical phrases and exploration of text corpora by topics

Post Topic Modeling Phrase Construction

• Turbo Topics• Keyphrase Extraction and Ranking by Topics

(KERT)• Constructing Topical Hierarchies (CATHY)

Page 7: Survey- topical phrases and exploration of text corpora by topics

Turbo Topics

Construction of Topical Phrases as a post-processing step to LDA.1. Perform LDA topic modeling on the input corpus.2. Find consecutive words that are of the same topic within the

corpus.3. Perform recursive permutation tests on a back-off model to test

significance of phrases

Page 8: Survey- topical phrases and exploration of text corpora by topics

KERTConstruction of Topical Phrases as a post-processing

step to LDA.1. Perform LDA topic modeling on the input corpus.2. Partition each document into K documents (K =

#topics)3. For each topic, mine frequent patterns on the new

corpus.4. Rank the output for each topic based on four

heuristic measures• Purity – measure of how often a phrase appears in a

topic in relation to other topics• Completeness – threshold filtering of subphrases

(support vector vs support vector machines)• Coverage – measure of frequency of phrase in topic• Phrase-ness – ratio of expected occurrence of a phrase

to actual occurence

learning

support vector machines

reinforcement learning

feature selection

conditional random fields

classification

decision trees

:

:

Page 9: Survey- topical phrases and exploration of text corpora by topics

KERT: Sample Results

Page 10: Survey- topical phrases and exploration of text corpora by topics

CATHY1. Create a term co-occurrence network. 2. Cluster the link weights into “topics”. 3. To find subtopics, recursively cluster the

subgraphsDBLP

Data & Information system …

DB DM IR AI …

Page 11: Survey- topical phrases and exploration of text corpora by topics

CATHY: Sample Results

Page 12: Survey- topical phrases and exploration of text corpora by topics

Inferring Phrase and Topic

• Bigram Topic Model• Topical N-grams• Phrase-Discovering LDA

Page 13: Survey- topical phrases and exploration of text corpora by topics

Bigram Topic Model

1. Draw a discrete word distribution from a Dirichlet priori for each Topic/word pair.

2. For each document, draw a discrete topic distribution from a Dirichlet prior. Then for each word:– Draw a topic from the document-

topic distribution from Step #2– Draw a word from the topic/word

distribution in #1 from the topic drawn and the distribution conditioned on the previous word.

Page 14: Survey- topical phrases and exploration of text corpora by topics

Topical N-Grams1. Draw discrete word distribution from a Dirichlet

prior for each topic 2. Draw Bernoulli distributions from a Beta prior

for each topic and each word3. Draw Discrete distributions from a Dirichlet prior

for each topic and each word4. For each document, draw a discrete document

topic distribution from a dirichlet prior. Then for each word in the document– Draw a value, X, from the Bernoulli

distribution conditioned on the previous word and its topic

– Draw a topic, T, from the document topic distribution

– Draw a word from the previous word’s bigram word-topic distribution if X = 1; else draw from the unigram topic distribution from topic T

Page 15: Survey- topical phrases and exploration of text corpora by topics

TNG: Sample Results

Page 16: Survey- topical phrases and exploration of text corpora by topics

Phrase-Discovering LDA

• Each sentence is viewed as a as a time-series of words• PD-LDA posits that the generative parameter (the topic)

changes periodically in accordance with changepoint indicators. As such topical phrases can be arbitrarily long, but always be drawn from a single topic.

The Process is as follows

1. Each token is drawn from a distribution conditioned on the context.

2. The context consists of the phrase topic and the past m words. When m=1, this is analogous to TNG, but for larger contexts, the distributions are Pitman-Yor processes linked together in a hierarchical tree structure.

Page 17: Survey- topical phrases and exploration of text corpora by topics

PD-LDA Results

Page 18: Survey- topical phrases and exploration of text corpora by topics

Separating Phrases and Topics: ToPMine

• ToPMine addresses the computational and quality issues of other topical phrase extraction methods by separating phrase finding from topic modeling.

• First the corpus is transformed from a bag-of-words to a bag-of-phrases via an efficient agglomerative grouping algorithm.

• The resultant bag-of-phrases is then the input to a PhraseLDA, a phrase-based topic model

Page 19: Survey- topical phrases and exploration of text corpora by topics

ToPMine

Good Phrases!

Step 1: Step 2:

Page 20: Survey- topical phrases and exploration of text corpora by topics

Constraining LDA

• Constraining LDA has been performed with positive results.

• Sentence LDA constrains each sentence to take on the same topic.

• This model is adapted to work on mined phrases, where each phrase shares a single topic (a weaker assumption than sentences).

Page 21: Survey- topical phrases and exploration of text corpora by topics

Visualizing Topic Models

Page 22: Survey- topical phrases and exploration of text corpora by topics

Visualizing Topic Models

Page 23: Survey- topical phrases and exploration of text corpora by topics

Q&A


Recommended