Survey- topical phrases and exploration of text corpora by topics

  • View
    159

  • Download
    1

Embed Size (px)

DESCRIPTION

University of Illinois at Urbana-Champaign DEPARTMENT OF COMPUTER SCIENCE CS 512 – Spring 2014 Ahmed El-Kishky Mehrdad Biglari Ramesh Kolavennu

Text of Survey- topical phrases and exploration of text corpora by topics

  • 1. University of Illinois at Urbana-Champaign DEPARTMENT OF COMPUTER SCIENCE Topical Phrases and Exploration of Text Corpora by Topics Ahmed El-Kishky Mehrdad Biglari Ramesh Kolavennu CS 512 Spring 2014 Professor: Jiawei Han
  • 2. Introduction This survey is in relation to a system for text exploration based on the framework ToPMine Scalable Topical Phrase Mining from Large Text Corpora, El-Kishky, Song, Wang, Voss, Han
  • 3. Outline Unigram Topic Modeling Latent Dirichlet Allocation Post Topic Model Topical Phrase Construction Turbo Topics Keyphrase Extraction and Ranking by Topic (KERT) Constructing Topical Hierarchies Inferring Phrases and Topic Bigram Language Model Topical N-grams Phrase-Discovering LDA Separation of Phrase and Topic ToPMine Constraining LDA Hidden Topic Markov Model Sentence LDA Frequent Pattern Mining and Topic Models Sentence LDA Phrase Extraction Literature Common Phrase Extraction Techniques Visualization Visualizing Topic Models
  • 4. Latent Dirichlet Allocation LDA is a generative topic model that posits that each document is a mixture of a small number of topics, with each topic a word multinomial. 1. Sample a distribution over topics from a Dirichlet prior. 2. For each word in the document Randomly choose a topic from the distribution over topics in step #1. Randomly choose a word from the corresponding distribution over the vocabulary.
  • 5. LDA Sample Topics
  • 6. Post Topic Modeling Phrase Construction Turbo Topics Keyphrase Extraction and Ranking by Topics (KERT) Constructing Topical Hierarchies (CATHY)
  • 7. Turbo Topics Construction of Topical Phrases as a post-processing step to LDA. 1. Perform LDA topic modeling on the input corpus. 2. Find consecutive words that are of the same topic within the corpus. 3. Perform recursive permutation tests on a back-off model to test significance of phrases
  • 8. KERT Construction of Topical Phrases as a post-processing step to LDA. 1. Perform LDA topic modeling on the input corpus. 2. Partition each document into K documents (K = #topics) 3. For each topic, mine frequent patterns on the new corpus. 4. Rank the output for each topic based on four heuristic measures Purity measure of how often a phrase appears in a topic in relation to other topics Completeness threshold filtering of subphrases (support vector vs support vector machines) Coverage measure of frequency of phrase in topic Phrase-ness ratio of expected occurrence of a phrase to actual occurence learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees : :
  • 9. KERT: Sample Results
  • 10. CATHY 1. Create a term co-occurrence network. 2. Cluster the link weights into topics. 3. To find subtopics, recursively cluster the subgraphs DBLP Data & Information system DB DM IR AI
  • 11. CATHY: Sample Results
  • 12. Inferring Phrase and Topic Bigram Topic Model Topical N-grams Phrase-Discovering LDA
  • 13. Bigram Topic Model 1. Draw a discrete word distribution from a Dirichlet priori for each Topic/word pair. 2. For each document, draw a discrete topic distribution from a Dirichlet prior. Then for each word: Draw a topic from the document-topic distribution from Step #2 Draw a word from the topic/word distribution in #1 from the topic drawn and the distribution conditioned on the previous word.
  • 14. Topical N-Grams 1. Draw discrete word distribution from a Dirichlet prior for each topic 2. Draw Bernoulli distributions from a Beta prior for each topic and each word 3. Draw Discrete distributions from a Dirichlet prior for each topic and each word 4. For each document, draw a discrete document topic distribution from a dirichlet prior. Then for each word in the document Draw a value, X, from the Bernoulli distribution conditioned on the previous word and its topic Draw a topic, T, from the document topic distribution Draw a word from the previous words bigram word-topic distribution if X = 1; else draw from the unigram topic distribution from topic T
  • 15. TNG: Sample Results
  • 16. Phrase-Discovering LDA Each sentence is viewed as a as a time-series of words PD-LDA posits that the generative parameter (the topic) changes periodically in accordance with changepoint indicators. As such topical phrases can be arbitrarily long, but always be drawn from a single topic. The Process is as follows 1. Each token is drawn from a distribution conditioned on the context. 2. The context consists of the phrase topic and the past m words. When m=1, this is analogous to TNG, but for larger contexts, the distributions are Pitman-Yor processes linked together in a hierarchical tree structure.
  • 17. PD-LDA Results
  • 18. Separating Phrases and Topics: ToPMine ToPMine addresses the computational and quality issues of other topical phrase extraction methods by separating phrase finding from topic modeling. First the corpus is transformed from a bag-of-words to a bag-of-phrases via an efficient agglomerative grouping algorithm. The resultant bag-of-phrases is then the input to a PhraseLDA, a phrase-based topic model
  • 19. ToPMine Step 1: Step 2: Good Phrases!
  • 20. Constraining LDA Constraining LDA has been performed with positive results. Sentence LDA constrains each sentence to take on the same topic. This model is adapted to work on mined phrases, where each phrase shares a single topic (a weaker assumption than sentences).
  • 21. Visualizing Topic Models
  • 22. Visualizing Topic Models
  • 23. Q&A