DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Text Mining – Part 2��Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)


Association


Keyword-Based Association Analysis

•  Aims to discover sets of keywords that occur frequently ��together in the documents

•  Relies on the usual techniques for mining associative��and correlation rules

•  Each document is considered as a transaction of type��

{document id, {set of keywords}}


Why Mining Associations?

•  Association mining may discover set of consecutive or closely-located keywords, called terms or phrase§ Compound (e.g., {Stanford,University})§ Noncompound (e.g., {dollars,shares,exchange})

•  Once discovered the most frequent terms, term-level mining can be applied most effectively (w.r.t. single word level)

•  They are useful for improving accuracy of many NLP tasks§ POS tagging, parsing, entity recognition, acronym expansion§ Grammar learning

4


Why Mining Associations?

•  They are useful for improving accuracy of many NLP tasks§ POS tagging, parsing, entity recognition, acronym expansion§ Grammar learning

•  They are directly useful for many applications in text retrieval and mining § Text retrieval (e.g., use word associations to suggest a

variation of a query) § Automatic construction of topic map for browsing: words as

nodes and associations as edges § Compare and summarize opinions (e.g., what words are most

strongly associated with “battery” in positive and negative reviews about iPhone 6, respectively?)

5


Basic Word Relations:��Paradigmatic vs. Syntagmatic ��

•  Paradigmatic§ A & B have paradigmatic relation if they can be substituted for

each other (i.e., A & B are in the same class) § For example, “cat” and “dog”, “Monday” and “Tuesday”

•  Syntagmatic§ A & B have syntagmatic relation if they can be combined with

each other (i.e., A & B are related semantically)§ For example, “cat” and “sit”, “car” and “drive”

•  These two basic and complementary relations can be generalized to describe relations of any items in a language

6


Mining Paradigmatic Relations


Mining Syntagmatic Relations


How to Mine Paradigmatic and��Syntagmatic Relations?

•  Paradigmatic§ Represent each word by its context§ Compute context similarity§ Words with high context similarity likely have paradigmatic relation

•  Syntagmatic § Count how many times two words occur together in a context

(e.g., sentence or paragraph) § Compare their co-occurrences with their individual occurrences § Words with high co-occurrences but relatively low individual

occurrences likely have syntagmatic relation •  Paradigmatically related words tend to have syntagmatic relation with

the same word, thus we can join the discovery of the two relations

9






Sim(“Cat”, “Dog”) = ��Sim(Left1(“cat”), Left1(“dog”)) + ��Sim(Right1(“cat”), Right1(“dog”)) + ��…��+ Sim(Window8(“cat”), Window8(“dog”))=?��

��High value of Sim(word1, word2) implies that ��word1 and word2 are paradigmatically related


Adapting BM25 to Paradigmatic Relation Mining 13







Predicting Words

• We want to predict whether the word W is present or absent in a text segment•  Some words are easier to predict than others§ W = “the” (easy because frequent)§ W = “unicorn” (easy because rare)§ W = “meat” (difficult because not frequent nor rare)

•  Formal definition§ Binary random variable Xw is 1 if w is present 0 otherwise§ p(Xw=1) + p(Xw=0) = 1

•  The more random Xw is the more difficult to predict it is

16

How can we measure the “randomness” of Xw? J


Conditional Entropy

•  The entropy of Xw measures the randomness of predicting w§ The lower the entropy the easier to predict§ The higher the entropy the more difficult to predict

•  Suppose now that “eat” is present in the segment and that we need to predict the presence of “meat”

•  Questions§ Does the presence of “eat” help predicting the presence of “meat”?§ Does it reduces the uncertainty about “meat”?§ What if “eat” is not present?

17


Complete Formula


In general, for any discrete random variable��we have that H(X) ≥ H(X|Y)��

��What is the minimum possible value of H(X|Y)?��

��Which one is smaller? H(Xmeat|Xthe) or H(Xmeat|Xeat)?


Mining Syntagmatic Relations using Entropy

•  For each word W1§ For every other word W2 compute H(XW1|XW2)§ Sort all the candidates in ascending order of H(XW1|XW2)§ Take the top-ranked candidate words as words that have

potential syntagmatic relations with W1§ Need to use a threshold for each W1

•  However, while H(XW1|XW2) and H(XW1|XW3) are comparable, H(XW1|XW2) and H(XW3|XW2) aren’t!

20


Text Classification


Document classification

•  Solve the problem of labeling automatically text documents on the basis of§ Topic§ Style§ Purpose

•  Usual classification techniques can be used to learn from a training set of manually labeled documents• Which features? Keywords can be thousands…•  Major approaches§ Similarity-based§ Dimensionality reduction§ Naïve Bayes text classifiers

22


Similarity-based Text Classifiers

•  Exploits Information Retrieval and k-nearest-neighbor classifier§ For a new document to classify, the k most similar documents

in the training set are retrieved§ Documents are classified on the basis of the class distribution

among the k documents retrieved using the majority vote or a weighted vote

§ Tuning k is very important to achieve a good performance

•  Limitations§ Space overhead to store all the documents in training set§ Time overhead to retrieve the similar documents

23


Dimensionality Reduction for Text Classification

•  As in the Vector Space Model, the goal is to reduce the number of features to represents text•  Usual dimensionality reduction approaches in Information

Retrieval are based on the distribution of keywords among the whole documents database•  In text classification it is important to consider also the correlation

between keywords and classes§ Rare keywords have a high TF-IDF but might be uniformly

distributed among classes§ LSA and LPA do not take into account classes distributions

•  Usual classification techniques can be then applied on reduced features space are SVMs and Naïve Bayes classifiers

24


•  Document categories {C1 , …, Cn}•  Document to classify D•  Probabilistic model:

• We chose the class C* such that

•  Issues§ Which features?§ How to compute the probabilities?

Naïve Bayes for Text Classification

( | ) ( )( | )( )i i

iP D C P CP C D

P D=

25



•  Features can be simply defined as the words in the document•  Let ai be a keyword in the doc, and wj a word in the vocabulary,

we get:

Example

H={like,dislike} D= “Our approach to representing arbitrary text documents is disturbingly simple”

26



•  Features can be simply defined as the words in the document•  Let ai be a keyword in the doc, and wj a word in the vocabulary,

•  Assumptions§ Keywords distributions are inter-independent§ Keywords distributions are order-independent

27



•  How to compute probabilities?§ Simply counting the occurrences may lead to wrong results when

probabilities are small•  M-estimate approach adapted for text:

§ Nc is the whole number of word positions in documents of class C§ Nc,k is the number of occurrences of wk in documents of class C§  |Vocabulary| is the number of distinct words in training set § Uniform priors are assumed

28



•  Final classification is performed as

•  Despite its simplicity Naïve Bayes classifiers works very well in practice

•  Applications§ Newsgroup post classification§ NewsWeeder (news recommender)


Clustering Text


Text Clustering

•  It involves the same steps as other text mining algorithms to preprocess the text data into an adequate representation§ Tokenizing and stemming each synopsis§ Transforming the corpus into vector space using TF-IDF§ Calculating cosine distance between each document as a

measure of similarity§ Clustering the documents using the similarity measure

•  Everything works as with the usual data except that text requires a rather long preprocessing.

31



Weka DEMO

•  Movie Review Data§ http://www.cs.cornell.edu/people/pabo/movie-review-data/

•  Load Text Data in Weka (CLI)

33


Weka DEMO (2)

•  From text to word vectors: StringToWordVector•  Naïve Bayes Classifier•  Stoplist§ https://code.google.com/p/stop-words/

•  Attribute Selection

34

Education

DMTM 2015 - 18 Text Mining Part 2