View
226
Download
2
Embed Size (px)
Citation preview
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
ICS 278: Data Mining
Lecture 14: Text Mining and Information Retrieval
Padhraic SmythDepartment of Information and Computer Science
University of California, Irvine
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Lecture Topics in Text Mining
• Information Retrieval
• Text Classification
• Text Clustering
• Information Extraction
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Text Mining Applications
• Information Retrieval– Query-based search of large text archives, e.g., the Web
• Text Classification– Automated assignment of topics to Web pages, e.g., Yahoo, Google– Automated classification of email into spam and non-spam
• Text Clustering– Automated organization of search results in real-time into categories– Discovery clusters and trends in technical literature (e.g. CiteSeer)
• Information Extraction– Extracting standard fields from free-text
• extracting names and places from reports, newspapers (e.g., military applications)
• extracting resume information automatically from resumes• Extracting protein interaction information from biology papers
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Text Mining
• Information Retrieval
• Text Classification
• Text Clustering
• Information Extraction
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
General concepts in Information Retrieval
• Representation language– typically a vector of d attribute values, e.g.,
• set of color, intensity, texture, features characterizing images• word counts for text documents
• Data set D of N objects– Typically represented as an N x d matrix
• Query Q:– User poses a query to search D– Query is typically expressed in the same representation
language as the data, e.g.,• each text document is a set of words that occur in the document• Query Q is also expressed as a set of words, e.g.,”data” and
“mining”
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Query by Content
• traditional DB query: exact matches– e.g. query Q = [level = MANAGER] AND [age < 30]– or, Boolean match on text
• query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”– Not useful when there are many matches
• E.g., “data mining” in Google returns 60 million documents
• query-by-content query: more general / less precise– e.g. what record is most similar to a query Q?– for text data, often called “information retrieval (IR)” – can also be used for images, sequences, video, etc– Q can itself be an object (e.g., a document) or a shorter version (e.g., 1
word)
• Goal– Match query Q to the N objects in the database– Return a ranked list (typically) of the most similar/relevant objects in
the data set D given Q
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Issues in Query by Content
– What representation language to use– How to measure similarity between Q and each object in D– How to compute the results in real-time (for interactive
querying)– How to rank the results for the user– Allowing user feedback (query modification)– How to evaluate and compare different IR algorithms/systems
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
The Standard Approach
• fixed-length (d dimensional) vector representation– for query (1-by-d Q) and and database (n-by-d X) objects
• use domain-specific higher-level features (vs raw)– image
• “bag of features”: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …
– text• “bag of words”: freq count for each word in each document, …• Also known as the “vector-space” model
• compute distances between vectorized representation
• use k-NN to find k vectors in X closest to Q
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Text Retrieval
• document: book, paper, WWW page, ...• term: word, word-pair, phrase, … (often: 50,000+)• query Q = set of terms, e.g., “data” + “mining”• NLP (natural language processing) too hard, so …• want (vector) representation for text which
– retains maximum useful semantics– supports efficient distance computes between docs and Q
• term weights – Boolean (e.g. term in document or not); “bag of words”– real-valued (e.g. freq term in doc; relative to all docs) ...
• notice: loses word order, sentence structure, etc.
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Practical Issues• Tokenization
– Convert document to word counts– word token = “any nonempty sequence of characters”– for HTML (etc) need to remove formatting
• Canonical forms, Stopwords, Stemming – Remove capitalization – Stopwords:
• remove very frequent words (a, the, and…) – can use standard list• Can also remove very rare words
– Stemming (next slide)
• Data representation– e.g., 3 column: <docid termid position>– Inverted index (faster)
• List of sorted <termid docid> pairs: useful for finding docs containing certain terms
• Equivalent to a sparse representation of term x doc matrix
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Stemming
• Want to reduce all morphological variants of a word to a single index term– e.g. a document containing words like fish and fisher may not be
retrieved by a query containing fishing (no fishing explicitly contained in the document)
• Stemming - reduce words to their root form• e.g. fish – becomes a new index term
• Porter stemming algorithm (1980)– relies on a preconstructed suffix list with associated rules
• e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE
– BINARIZATION => BINARIZE• Not always desirable: e.g., {university, universal} -> univers (in Porter’s)
• WordNet: dictionary-based approach
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Toy example of a document-term matrix
database
SQL index
regression
likelihood
linear
d1 24 21 9 0 0 3
d2 32 10 5 0 3 0
d3 12 16 5 0 0 0
d4 6 7 2 0 0 0
d5 43 31 20 0 3 0
d6 2 0 0 18 7 16
d7 0 0 1 32 12 0
d8 3 0 0 22 4 2
d9 1 0 0 34 27 25
d10 6 0 0 17 4 23
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Document Similarity
• Measuring similarity between 2 documents x and y:– wide variety of distance metrics:
• Euclidean (L2) = sqrt(i(xi - yi)2)
• L1 = I |xi - yi |
• ...
• weighted L2 = sqrt(i(wixi - wiyi)2)
• Cosine distance between docs
– often gives better results than Euclidean • normalizes relative to document length
yx
yxyx
T
),cos(
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Distance matrices for toy document-term
data
TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23
EuclideanDistances
CosineDistances
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
TF-IDF Term Weighting Schemes
• Not all terms in a query or document may be equally important...
• TF (term freq): term weight = number of times in that document– problem: term common to many docs => low discrimination
• IDF (inverse-document frequency of a term)– nj documents contain term j, N documents in total
– IDF = log(N/nj)
– Favors terms that occur in relatively few documents
• TF-IDF: TF(term)*IDF(term)
• No real theoretical basis, but works well empirically and widely used
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
TF-IDF Example
TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23
TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...
TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9)
IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Baseline Document Querying System
• Queries Q = binary term vectors
• Documents represented by TF-IDF weights
• Cosine distance used for retrieval and ranking
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Baseline Document Querying System
TF TF-IDFd1 0.70 0.32d2 0.77 0.51d3 0.58 0.24d4 0.60 0.23d5 0.79 0.43 ...
TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23
Q=(1,0,1,0,0,0)
TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Synonymy and Polysemy
• Synonymy– the same concept can be expressed using different sets of
terms• e.g. bandit, brigand, thief
– negatively affects recall
• Polysemy– identical terms can be used in very different semantic contexts
• e.g. bank– repository where important material is saved– the slope beside a body of water
– negatively affects precision
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Latent Semantic Indexing• Approximate data in the original d-dimensional space by data in a k-
dimensional space, where k << d
• Find the k linear projections of the data that contain the most variance– Principal components analysis or SVD– Also known as “latent semantic indexing” when applied to text
• Captures dependencies among terms– In effect represents original d-dimensional basis with a k-dimensional basis– e.g., terms like SQL, indexing, query, could be approximated as coming
from a single “hidden” term
• Why is this useful?– Query contains “automobile”, document contains “vehicle”– can still match Q to the document since the 2 terms will be close in k-
space (but not in original space), i.e., addresses synonymy problem
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Toy example of a document-term matrix
database
SQL index
regression
likelihood
linear
d1 24 21 9 0 0 3
d2 32 10 5 0 3 0
d3 12 16 5 0 0 0
d4 6 7 2 0 0 0
d5 43 31 20 0 3 0
d6 2 0 0 18 7 16
d7 0 0 1 32 12 0
d8 3 0 0 22 4 2
d9 1 0 0 34 27 25
d10 6 0 0 17 4 23
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
SVD
• M = U S VT
– M = n x d = original document-term matrix (the data)
– U = n x d , each row = vector of weights for each document
– S = d x d diagonal matrix of eigenvalues
– Columns of VT = new orthogonal basis for the data
– Each eigenvalue represents how much information is of the new “basis” vectors
– Typically select just the first k basis vectors, k << d
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Example of SVD
U1 U2
d1 30.9 -11.5
d2 30.3 -10.8
d3 18.0 -7.7
d4 8.4 -3.6
d5 52.7 -20.6
d6 14.2 21.8
d7 10.8 21.9
d8 11.5 28.0
d9 9.5 17.8
d10 19.9 45.0
database
SQL index regression likelihood linear
d1 24 21 9 0 0 3
d2 32 10 5 0 3 0
d3 12 16 5 0 0 0
d4 6 7 2 0 0 0
d5 43 31 20 0 3 0
d6 2 0 0 18 7 16
d7 0 0 1 32 12 0
d8 3 0 0 22 4 2
d9 1 0 0 34 27 25
d10 6 0 0 17 4 23
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19]v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31]
D1 = database x 50D2 = SQL x 50
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Probabilistic Approaches to Retrieval
• Compute P(q | d) for each document d– Intuition: relevance of d to q is related to how likely it is that q was
generated by d, or “how likely is q under a model for d?”
• Simple model for P(q|d)– Pe(q|d) = empirical frequency of words in document d
– “tuned” to d, but likely to be sparse (will contain many zeros)
• 2-stage probabilistic model (or linear interpolation model)– P(q|d) = Pe (q | d) + (1- ) Pe (q | corpus)
– can be fixed, e.g., tuned to a particular data set– Or it can depend on d, e.g., = nd/ (nd + m)
where nd = number of words in doc d, and m = a constant (e.g., 1000)
• Can also use more sophisticated models for P(q|d), e.g., topic-based models
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Evaluating Retrieval Methods
• predictive models (classify/regress) objective– score = accuracy on unseen test data
• evaluation more complex for query by content– real score = how “useful” is retrieved info (subjective)
• e.g. how would you define real score for Google’s top 10 hits?
• towards objectivity, assume:– 1) each object is “relevant” or “irrelevant”
• simplification: binary and same for all users (e.g. committee vote)– 2) each object labelled by objective/consistent oracle– these assumptions suggest classifier approach possible
• rather different goals: want nearest to Q, not separability per se• but would require learning classifier at query time (Q = pos class)
– which is why k-NN type approach seems so appropriate …
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Precision versus Recall
• Rank documents (numerically) with respect to query
• Compute precision and recall by thresholding the rankings
• precision – fraction of retrieved objects that are relevant
• recall – fraction of retrieved relevant objects / total relevant objects
• Tradeoff: high precision -> low recall, and vice-versa
• Very similar to ROC in concept
• For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Precision-Recall Curve (form of ROC)
C is universallyworse than A & B
alternative(point) values:
precision whererecall=precision
or
precision for fixednumber of retrievals
or
average precisionover multiple recall levels
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
TREC evaluations
• Text Retrieval Conference (TReC)– Web site: trec.nist.gov
• Annual impartial evaluation of IR systems– e.g., D = 1 million documents– TREC organizers supply contestants with several hundred queries
Q
– Each competing system provides its ranked list of documents
– Union of top 100 ranked documents or so from each system is then manually judged to be relevant or non-relevant for each query Q
– Precision, recall, etc, then calculated and systems compared
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Other Examples of Evaluation Data Sets
• Cranfield data– Number of documents = 1400– 225 Queries, “medium length”, manually constructed “test
questions”– Relevance = determined by expert committee (from 1968)
• Newsgroups– Articles from 20 Usenet newsgroups– Queries = randomly selected documents– Relevance: is the document d in the same category as the
query doc?
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Performance on Cranfield Document Set
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Performance on Newsgroups Data
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Related Types of Data
• Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g.,– “transaction data”
• Rows = customers• Columns = products
– Web log data (ignoring sequence)• Rows = Web surfers• Columns = Web pages
• Recommender systems– Given some products from user i, suggest other products to the user
• e.g., Amazon.com’s book recommender
– Collaborative filtering:• use k-nearest-individuals as the basis for predictions
– Many similarities with querying and information retrieval• e.g., use of cosine distance to normalize vectors
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Web-based Retrieval
• Additional information in Web documents– Link structure (e.g., PageRank: to be discussed later)– HTML structure
• Link/anchor text• Title text• Etc• Can be leveraged for better retrieval
• Additional issues in Web retrieval– Scalability: size of “corpus” is huge (10 to 100 billion docs)– Constantly changing:
• Crawlers to update document-term information• need schemes for efficient updating indices
– Evaluation is more difficult – how is relevance measured? How many documents in total are relevant?
Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine
Further Reading
• Text: Chapter 14
• General reference on text and language modeling– Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT
Press, 1999.
• Very useful reference on indexing and searching text:– Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd
edition, Morgan Kaufmann, 1999, by Ian H. Witten, Alistair Moffat, and Timothy C. Bell,
• Web-related Document Search:– An excellent resource on Web-related search is Chapter 3, Web Search and
Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003.
– Information on how real Web search engines work:• http://searchenginewatch.com/
• Latent Semantic Analysis– Applied to grading of essays: The debate on automated grading, IEEE Intelligent
Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf