Upload
pier-luca-lanzi
View
91
Download
3
Tags:
Embed Size (px)
Citation preview
Prof. Pier Luca Lanzi
Text Mining – Part 2���Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Association
Prof. Pier Luca Lanzi
Keyword-Based Association Analysis
• Aims to discover sets of keywords that occur frequently ���together in the documents
• Relies on the usual techniques for mining associative���and correlation rules
• Each document is considered as a transaction of type������
{document id, {set of keywords}}
Prof. Pier Luca Lanzi
Why Mining Associations?
• Association mining may discover set of consecutive or closely-located keywords, called terms or phrase§ Compound (e.g., {Stanford,University})§ Noncompound (e.g., {dollars,shares,exchange})
• Once discovered the most frequent terms, term-level mining can be applied most effectively (w.r.t. single word level)
• They are useful for improving accuracy of many NLP tasks§ POS tagging, parsing, entity recognition, acronym expansion§ Grammar learning
4
Prof. Pier Luca Lanzi
Why Mining Associations?
• They are useful for improving accuracy of many NLP tasks§ POS tagging, parsing, entity recognition, acronym expansion§ Grammar learning
• They are directly useful for many applications in text retrieval and mining § Text retrieval (e.g., use word associations to suggest a
variation of a query) § Automatic construction of topic map for browsing: words as
nodes and associations as edges § Compare and summarize opinions (e.g., what words are most
strongly associated with “battery” in positive and negative reviews about iPhone 6, respectively?)
5
Prof. Pier Luca Lanzi
Basic Word Relations:���Paradigmatic vs. Syntagmatic ���
• Paradigmatic§ A & B have paradigmatic relation if they can be substituted for
each other (i.e., A & B are in the same class) § For example, “cat” and “dog”, “Monday” and “Tuesday”
• Syntagmatic§ A & B have syntagmatic relation if they can be combined with
each other (i.e., A & B are related semantically)§ For example, “cat” and “sit”, “car” and “drive”
• These two basic and complementary relations can be generalized to describe relations of any items in a language
6
Prof. Pier Luca Lanzi
Mining Paradigmatic Relations
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations
Prof. Pier Luca Lanzi
How to Mine Paradigmatic and���Syntagmatic Relations?
• Paradigmatic§ Represent each word by its context§ Compute context similarity§ Words with high context similarity likely have paradigmatic relation
• Syntagmatic § Count how many times two words occur together in a context
(e.g., sentence or paragraph) § Compare their co-occurrences with their individual occurrences § Words with high co-occurrences but relatively low individual
occurrences likely have syntagmatic relation • Paradigmatically related words tend to have syntagmatic relation with
the same word, thus we can join the discovery of the two relations
9
Prof. Pier Luca Lanzi
Mining Paradigmatic Relations
Prof. Pier Luca Lanzi
Mining Paradigmatic Relations
Prof. Pier Luca Lanzi
Sim(“Cat”, “Dog”) = ���Sim(Left1(“cat”), Left1(“dog”)) + ���Sim(Right1(“cat”), Right1(“dog”)) + ���…���+ Sim(Window8(“cat”), Window8(“dog”))=?���
���High value of Sim(word1, word2) implies that ���word1 and word2 are paradigmatically related
Prof. Pier Luca Lanzi
Adapting BM25 to Paradigmatic Relation Mining 13
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations
Mining Paradigmatic Relations
Prof. Pier Luca Lanzi
Predicting Words
• We want to predict whether the word W is present or absent in a text segment• Some words are easier to predict than others§ W = “the” (easy because frequent)§ W = “unicorn” (easy because rare)§ W = “meat” (difficult because not frequent nor rare)
• Formal definition§ Binary random variable Xw is 1 if w is present 0 otherwise§ p(Xw=1) + p(Xw=0) = 1
• The more random Xw is the more difficult to predict it is
16
How can we measure the “randomness” of Xw? J
Prof. Pier Luca Lanzi
Conditional Entropy
• The entropy of Xw measures the randomness of predicting w§ The lower the entropy the easier to predict§ The higher the entropy the more difficult to predict
• Suppose now that “eat” is present in the segment and that we need to predict the presence of “meat”
• Questions§ Does the presence of “eat” help predicting the presence of “meat”?§ Does it reduces the uncertainty about “meat”?§ What if “eat” is not present?
17
Prof. Pier Luca Lanzi
Complete Formula
Prof. Pier Luca Lanzi
In general, for any discrete random variable���we have that H(X) ≥ H(X|Y)���
���What is the minimum possible value of H(X|Y)?���
���Which one is smaller? H(Xmeat|Xthe) or H(Xmeat|Xeat)?
Prof. Pier Luca Lanzi
Mining Syntagmatic Relations using Entropy
• For each word W1§ For every other word W2 compute H(XW1|XW2)§ Sort all the candidates in ascending order of H(XW1|XW2)§ Take the top-ranked candidate words as words that have
potential syntagmatic relations with W1§ Need to use a threshold for each W1
• However, while H(XW1|XW2) and H(XW1|XW3) are comparable, H(XW1|XW2) and H(XW3|XW2) aren’t!
20
Prof. Pier Luca Lanzi
Text Classification
Prof. Pier Luca Lanzi
Document classification
• Solve the problem of labeling automatically text documents on the basis of§ Topic§ Style§ Purpose
• Usual classification techniques can be used to learn from a training set of manually labeled documents• Which features? Keywords can be thousands…• Major approaches§ Similarity-based§ Dimensionality reduction§ Naïve Bayes text classifiers
22
Prof. Pier Luca Lanzi
Similarity-based Text Classifiers
• Exploits Information Retrieval and k-nearest-neighbor classifier§ For a new document to classify, the k most similar documents
in the training set are retrieved§ Documents are classified on the basis of the class distribution
among the k documents retrieved using the majority vote or a weighted vote
§ Tuning k is very important to achieve a good performance
• Limitations§ Space overhead to store all the documents in training set§ Time overhead to retrieve the similar documents
23
Prof. Pier Luca Lanzi
Dimensionality Reduction for Text Classification
• As in the Vector Space Model, the goal is to reduce the number of features to represents text• Usual dimensionality reduction approaches in Information
Retrieval are based on the distribution of keywords among the whole documents database• In text classification it is important to consider also the correlation
between keywords and classes§ Rare keywords have a high TF-IDF but might be uniformly
distributed among classes§ LSA and LPA do not take into account classes distributions
• Usual classification techniques can be then applied on reduced features space are SVMs and Naïve Bayes classifiers
24
Prof. Pier Luca Lanzi
• Document categories {C1 , …, Cn}• Document to classify D• Probabilistic model:
• We chose the class C* such that
• Issues§ Which features?§ How to compute the probabilities?
Naïve Bayes for Text Classification
( | ) ( )( | )( )i i
iP D C P CP C D
P D=
25
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
• Features can be simply defined as the words in the document• Let ai be a keyword in the doc, and wj a word in the vocabulary,
we get:
Example
H={like,dislike} D= “Our approach to representing arbitrary text documents is disturbingly simple”
26
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
• Features can be simply defined as the words in the document• Let ai be a keyword in the doc, and wj a word in the vocabulary,
• Assumptions§ Keywords distributions are inter-independent§ Keywords distributions are order-independent
27
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
• How to compute probabilities?§ Simply counting the occurrences may lead to wrong results when
probabilities are small• M-estimate approach adapted for text:
§ Nc is the whole number of word positions in documents of class C§ Nc,k is the number of occurrences of wk in documents of class C§ |Vocabulary| is the number of distinct words in training set § Uniform priors are assumed
28
Prof. Pier Luca Lanzi
Naïve Bayes for Text Classification
• Final classification is performed as
• Despite its simplicity Naïve Bayes classifiers works very well in practice
• Applications§ Newsgroup post classification§ NewsWeeder (news recommender)
Prof. Pier Luca Lanzi
Clustering Text
Prof. Pier Luca Lanzi
Text Clustering
• It involves the same steps as other text mining algorithms to preprocess the text data into an adequate representation§ Tokenizing and stemming each synopsis§ Transforming the corpus into vector space using TF-IDF§ Calculating cosine distance between each document as a
measure of similarity§ Clustering the documents using the similarity measure
• Everything works as with the usual data except that text requires a rather long preprocessing.
31
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Weka DEMO
• Movie Review Data§ http://www.cs.cornell.edu/people/pabo/movie-review-data/
• Load Text Data in Weka (CLI)
33
Prof. Pier Luca Lanzi
Weka DEMO (2)
• From text to word vectors: StringToWordVector• Naïve Bayes Classifier• Stoplist§ https://code.google.com/p/stop-words/
• Attribute Selection
34