Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval Padhraic

Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 14: Text Mining and Information Retrieval

Padhraic SmythDepartment of Information and Computer Science

University of California, Irvine


Lecture Topics in Text Mining

• Information Retrieval

• Text Classification

• Text Clustering

• Information Extraction


Text Mining Applications

• Information Retrieval– Query-based search of large text archives, e.g., the Web

• Text Classification– Automated assignment of topics to Web pages, e.g., Yahoo, Google– Automated classification of email into spam and non-spam

• Text Clustering– Automated organization of search results in real-time into categories– Discovery clusters and trends in technical literature (e.g. CiteSeer)

• Information Extraction– Extracting standard fields from free-text

• extracting names and places from reports, newspapers (e.g., military applications)

• extracting resume information automatically from resumes• Extracting protein interaction information from biology papers


Text Mining

• Information Retrieval

• Text Classification

• Text Clustering

• Information Extraction


General concepts in Information Retrieval

• Representation language– typically a vector of d attribute values, e.g.,

• set of color, intensity, texture, features characterizing images• word counts for text documents

• Data set D of N objects– Typically represented as an N x d matrix

• Query Q:– User poses a query to search D– Query is typically expressed in the same representation

language as the data, e.g.,• each text document is a set of words that occur in the document• Query Q is also expressed as a set of words, e.g.,”data” and

“mining”


Query by Content

• traditional DB query: exact matches– e.g. query Q = [level = MANAGER] AND [age < 30]– or, Boolean match on text

• query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun”– Not useful when there are many matches

• E.g., “data mining” in Google returns 60 million documents

• query-by-content query: more general / less precise– e.g. what record is most similar to a query Q?– for text data, often called “information retrieval (IR)” – can also be used for images, sequences, video, etc– Q can itself be an object (e.g., a document) or a shorter version (e.g., 1

word)

• Goal– Match query Q to the N objects in the database– Return a ranked list (typically) of the most similar/relevant objects in

the data set D given Q


Issues in Query by Content

– What representation language to use– How to measure similarity between Q and each object in D– How to compute the results in real-time (for interactive

querying)– How to rank the results for the user– Allowing user feedback (query modification)– How to evaluate and compare different IR algorithms/systems


The Standard Approach

• fixed-length (d dimensional) vector representation– for query (1-by-d Q) and and database (n-by-d X) objects

• use domain-specific higher-level features (vs raw)– image

• “bag of features”: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …

– text• “bag of words”: freq count for each word in each document, …• Also known as the “vector-space” model

• compute distances between vectorized representation

• use k-NN to find k vectors in X closest to Q


Text Retrieval

• document: book, paper, WWW page, ...• term: word, word-pair, phrase, … (often: 50,000+)• query Q = set of terms, e.g., “data” + “mining”• NLP (natural language processing) too hard, so …• want (vector) representation for text which

– retains maximum useful semantics– supports efficient distance computes between docs and Q

• term weights – Boolean (e.g. term in document or not); “bag of words”– real-valued (e.g. freq term in doc; relative to all docs) ...

• notice: loses word order, sentence structure, etc.


Practical Issues• Tokenization

– Convert document to word counts– word token = “any nonempty sequence of characters”– for HTML (etc) need to remove formatting

• Canonical forms, Stopwords, Stemming – Remove capitalization – Stopwords:

• remove very frequent words (a, the, and…) – can use standard list• Can also remove very rare words

– Stemming (next slide)

• Data representation– e.g., 3 column: <docid termid position>– Inverted index (faster)

• List of sorted <termid docid> pairs: useful for finding docs containing certain terms

• Equivalent to a sparse representation of term x doc matrix


Stemming

• Want to reduce all morphological variants of a word to a single index term– e.g. a document containing words like fish and fisher may not be

retrieved by a query containing fishing (no fishing explicitly contained in the document)

• Stemming - reduce words to their root form• e.g. fish – becomes a new index term

• Porter stemming algorithm (1980)– relies on a preconstructed suffix list with associated rules

• e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE

– BINARIZATION => BINARIZE• Not always desirable: e.g., {university, universal} -> univers (in Porter’s)

• WordNet: dictionary-based approach


Toy example of a document-term matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


Document Similarity

• Measuring similarity between 2 documents x and y:– wide variety of distance metrics:

• Euclidean (L2) = sqrt(i(xi - yi)2)

• L1 = I |xi - yi |

• ...

• weighted L2 = sqrt(i(wixi - wiyi)2)

• Cosine distance between docs

– often gives better results than Euclidean • normalizes relative to document length

yx

yxyx

T

),cos(


Distance matrices for toy document-term

data

TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

EuclideanDistances

CosineDistances


TF-IDF Term Weighting Schemes

• Not all terms in a query or document may be equally important...

• TF (term freq): term weight = number of times in that document– problem: term common to many docs => low discrimination

• IDF (inverse-document frequency of a term)– nj documents contain term j, N documents in total

– IDF = log(N/nj)

– Favors terms that occur in relatively few documents

• TF-IDF: TF(term)*IDF(term)

• No real theoretical basis, but works well empirically and widely used


TF-IDF Example


TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...

TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9)

IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)


Baseline Document Querying System

• Queries Q = binary term vectors

• Documents represented by TF-IDF weights

• Cosine distance used for retrieval and ranking


Baseline Document Querying System

TF TF-IDFd1 0.70 0.32d2 0.77 0.51d3 0.58 0.24d4 0.60 0.23d5 0.79 0.43 ...


Q=(1,0,1,0,0,0)

TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...


Synonymy and Polysemy

• Synonymy– the same concept can be expressed using different sets of

terms• e.g. bandit, brigand, thief

– negatively affects recall

• Polysemy– identical terms can be used in very different semantic contexts

• e.g. bank– repository where important material is saved– the slope beside a body of water

– negatively affects precision


Latent Semantic Indexing• Approximate data in the original d-dimensional space by data in a k-

dimensional space, where k << d

• Find the k linear projections of the data that contain the most variance– Principal components analysis or SVD– Also known as “latent semantic indexing” when applied to text

• Captures dependencies among terms– In effect represents original d-dimensional basis with a k-dimensional basis– e.g., terms like SQL, indexing, query, could be approximated as coming

from a single “hidden” term

• Why is this useful?– Query contains “automobile”, document contains “vehicle”– can still match Q to the document since the 2 terms will be close in k-

space (but not in original space), i.e., addresses synonymy problem


Toy example of a document-term matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


SVD

• M = U S VT

– M = n x d = original document-term matrix (the data)

– U = n x d , each row = vector of weights for each document

– S = d x d diagonal matrix of eigenvalues

– Columns of VT = new orthogonal basis for the data

– Each eigenvalue represents how much information is of the new “basis” vectors

– Typically select just the first k basis vectors, k << d


Example of SVD

U1 U2

d1 30.9 -11.5

d2 30.3 -10.8

d3 18.0 -7.7

d4 8.4 -3.6

d5 52.7 -20.6

d6 14.2 21.8

d7 10.8 21.9

d8 11.5 28.0

d9 9.5 17.8

d10 19.9 45.0

database

SQL index regression likelihood linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19]v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31]

D1 = database x 50D2 = SQL x 50


Probabilistic Approaches to Retrieval

• Compute P(q | d) for each document d– Intuition: relevance of d to q is related to how likely it is that q was

generated by d, or “how likely is q under a model for d?”

• Simple model for P(q|d)– Pe(q|d) = empirical frequency of words in document d

– “tuned” to d, but likely to be sparse (will contain many zeros)

• 2-stage probabilistic model (or linear interpolation model)– P(q|d) = Pe (q | d) + (1- ) Pe (q | corpus)

– can be fixed, e.g., tuned to a particular data set– Or it can depend on d, e.g., = nd/ (nd + m)

where nd = number of words in doc d, and m = a constant (e.g., 1000)

• Can also use more sophisticated models for P(q|d), e.g., topic-based models


Evaluating Retrieval Methods

• predictive models (classify/regress) objective– score = accuracy on unseen test data

• evaluation more complex for query by content– real score = how “useful” is retrieved info (subjective)

• e.g. how would you define real score for Google’s top 10 hits?

• towards objectivity, assume:– 1) each object is “relevant” or “irrelevant”

• simplification: binary and same for all users (e.g. committee vote)– 2) each object labelled by objective/consistent oracle– these assumptions suggest classifier approach possible

• rather different goals: want nearest to Q, not separability per se• but would require learning classifier at query time (Q = pos class)

– which is why k-NN type approach seems so appropriate …


Precision versus Recall

• Rank documents (numerically) with respect to query

• Compute precision and recall by thresholding the rankings

• precision – fraction of retrieved objects that are relevant

• recall – fraction of retrieved relevant objects / total relevant objects

• Tradeoff: high precision -> low recall, and vice-versa

• Very similar to ROC in concept

• For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).


Precision-Recall Curve (form of ROC)

C is universallyworse than A & B

alternative(point) values:

precision whererecall=precision

or

precision for fixednumber of retrievals

or

average precisionover multiple recall levels


TREC evaluations

• Text Retrieval Conference (TReC)– Web site: trec.nist.gov

• Annual impartial evaluation of IR systems– e.g., D = 1 million documents– TREC organizers supply contestants with several hundred queries

Q

– Each competing system provides its ranked list of documents

– Union of top 100 ranked documents or so from each system is then manually judged to be relevant or non-relevant for each query Q

– Precision, recall, etc, then calculated and systems compared


Other Examples of Evaluation Data Sets

• Cranfield data– Number of documents = 1400– 225 Queries, “medium length”, manually constructed “test

questions”– Relevance = determined by expert committee (from 1968)

• Newsgroups– Articles from 20 Usenet newsgroups– Queries = randomly selected documents– Relevance: is the document d in the same category as the

query doc?


Performance on Cranfield Document Set


Performance on Newsgroups Data


Related Types of Data

• Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g.,– “transaction data”

• Rows = customers• Columns = products

– Web log data (ignoring sequence)• Rows = Web surfers• Columns = Web pages

• Recommender systems– Given some products from user i, suggest other products to the user

• e.g., Amazon.com’s book recommender

– Collaborative filtering:• use k-nearest-individuals as the basis for predictions

– Many similarities with querying and information retrieval• e.g., use of cosine distance to normalize vectors


Web-based Retrieval

• Additional information in Web documents– Link structure (e.g., PageRank: to be discussed later)– HTML structure

• Link/anchor text• Title text• Etc• Can be leveraged for better retrieval

• Additional issues in Web retrieval– Scalability: size of “corpus” is huge (10 to 100 billion docs)– Constantly changing:

• Crawlers to update document-term information• need schemes for efficient updating indices

– Evaluation is more difficult – how is relevance measured? How many documents in total are relevant?


Further Reading

• Text: Chapter 14

• General reference on text and language modeling– Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT

Press, 1999.

• Very useful reference on indexing and searching text:– Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd

edition, Morgan Kaufmann, 1999, by Ian H. Witten, Alistair Moffat, and Timothy C. Bell,

• Web-related Document Search:– An excellent resource on Web-related search is Chapter 3, Web Search and

Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003.

– Information on how real Web search engines work:• http://searchenginewatch.com/

• Latent Semantic Analysis– Applied to grading of essays: The debate on automated grading, IEEE Intelligent

Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf

http://www.cs.waikato.ac.nz/%7Eihw/

http://www.cs.waikato.ac.nz/%7Eihw/

http://www.cs.mu.oz.au/%7Ealistair/

http://www.cosc.canterbury.ac.nz/%7Etim/

http://searchenginewatch.com/

http://www.k-a-t.com/papers/IEEEdebate.pdf