2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

2. Text Mining

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179

Text Mining

Goals

To learn key problems and techniques in the mining one of the most common types ofdata

To learn how to represent text numerically

To learn how to make use of enormous amounts of unlabeled data

To learn how to find co-occurring keywords in documents


2.1 Basics of Text Representation and Analysis

based on:

Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 13


What is text mining?

Definition

Text mining is the use of automated methods for exploiting the enormous amount ofknowledge available in the (biomedical) literature.

Motivation

Most knowledge is stored in terms of texts, both in industry and in academia.

This alone makes text mining an integral part of knowledge discovery!

Furthermore, to make text machine-readable, one has to solve several recognition(mining) tasks on text.


Why text mining?

Text data is growing in an unprecedented manner

Digital libraries

Web and Web-enabled applications (e.g. Social networks)

Newswire services


Text mining terminology

Important definitions

A set of features of text is also referred to as a lexicon.

A document can be either viewed as a sequence or multidimensional record.

A collection of documents is referred to as a corpus.


Text mining terminology

Number of special characteristics of text data

Very sparse

Diverse length

Nonnegative statistics

Side information is often available, e.g. Hyperlink, meta-data

Lots of unlabeled data


What is text mining?

Common tasks

Information retrieval: Find documents that are relevant to a user, or to a query in acollection of documents

Document ranking: rank all documents in the collectionDocument selection: classify documents into relevant and irrelevant

Information filtering: Search newly created documents for information that is relevant toa user

Document classification: Assign a document to a category that describes its content

Keyword co-occurrence: Find groups of keywords that co-occur in many documents


Evaluation text mining

Precision and Recall

Let the set of documents that are relevant to a query be denoted as {Relevant} and theset of retrieved documents as {Retrieved}.

The precision is the percentage of retrieved documents that are relevant to the query

precision =∣{Relevant} ∩ {Retrieved}∣

∣{Retrieved}∣(1)

The recall is the percentage of relevant documents that were retrieved by the query:

recall =∣{Relevant} ∩ {Retrieved}∣

∣{Relevant}∣(2)


Text representation

Tokenization

Tokenization is the process of identifying keywords in a document.

Not all words in a text are relevant.

Text mining ignores stop words.

Stop words form the stop list.

Stop lists are context-dependent.


Text representation

Vector space model

Given #d documents and #t terms.

Model each document as a vector v in a t-dimensional space.

Weighted term-frequency matrix

Matrix TF of size #d ×#t

Entries measure association of term and document

If a term t does not occur in a document d , then TF (d , t) = 0.

If a term t does occur in a document d , then TF (d , t) > 0.


Text representation

Definitions of term frequency

If term t occurs in document d , then

TF (d , t) = 1TF (d , t) = frequency of t in d (freq(d,t))

TF (d , t) =freq(d,t)

∑t′∈T freq(d,t’)

TF (d , t) =

⎧⎪⎪⎨⎪⎪⎩

1 + log(freq(d,t)) freq(d,t) > 0,

0 freq(d,t) = 0.


Text representation

Inverse document frequency

The inverse document frequency (IDF) represents the scaling factor, or importance, of aterm.

A term that appears in many documents is scaled down:

IDF (t) = log1 + ∣d ∣

∣dt ∣, (3)

where ∣d ∣ is the number of all documents, and ∣dt ∣ is the number of documentscontaining term t.


Text representation

TF-IDF measure

The TF-IDF measure is the product of term frequency and inverse document frequency:

TF -IDF (d , t) = TF (d , t)IDF (t). (4)


Measuring similarity

Cosine measure

Let v1 and v2 be two document vectors.

The cosine similarity is defined as

sim(v1,v2) =v⊺1v2

∣v1∣∣v2∣. (5)

Kernels

Depending on how we represent a document, there are many kernels available formeasuring similarity of these representations:

vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel,one long string: string kernels that count common k-mers in two strings.


2.2 Topic Modeling

based on:

Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.4.3 and 13.4


Topic Modeling

Definition

Topic modeling can be viewed as a probabilistic version of latent semantic analysis(LSA).

Its most basic version is referred to as Probabilistic Latent Semantic Analysis (PLSA).

It provides an alternative method for performing dimensionality reduction and hasseveral advantages over LSA.


Topic Modeling: SVD on text

Latent Semantic Analysis

Latent Semantic Analysis (LSA) is an application of SVD to the text domain.

The goal is to retrieve a vectorial representation of terms and documents.

The data matrix D is an n × d document-term matrix containing word frequencies in then documents, where d is the size of the lexicon.


Topic Modeling: SVD on text


Wordsd

n ≈

Topics

k

n

Topics(Importance)

k

k

RkT

d

k

Doc

umen

ts

Doc

umen

ts

Topi

csD

Words

Topi

cs

k do

cum

ent b

asis

ve

ctor

s Δk

ΔkLk

k document basis vectors of

documentsDocument Term

Matrix


Topic Modeling: Centering and sparsity


No mean centering is used.

The results are approximately the same as for PCA because of the sparsity of D: Thesparsity implies that most of the entries are zero, and that the mean is much smallerthan the non-zero entries. In such scenarios, it can be shown that the covariance matrixis approximately proportional to D⊺D.

The sparsity of the data also results in a low intrinsic dimensionality.

The dimensionality reduction effect of LSA is rather drastic: Often, a corpus representedon a lexicon on 100,000 dimensions can be summarized in fewer than 300 dimensions.

LSA is also a classic example of how to the ”loss” of information from discarding somedimensions can actually result in an improvement in the quality of the datarepresentation.


Topic Modeling: Synonymy and polysemy


Synonymy refers to the fact that two words have the same meaning, e.g. comical andhilarious.

Polysemy refers to the fact that the same word has two different meanings, e.g. jaguar.

Typically the meaning of a word can be understood from its context, but frequencyterms do not capture the context sufficiently, e.g. two documents containing the wordscomical and hilarious may not be deemed sufficiently similar.

The truncated representation after LSA typically removes the noise of effects ofsynonymy and polysemy because the singular vectors represent the direction ofcorrelation in the data.


Topic Modeling

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis (PLSA) is a probabilistic variant of LSA and SVD.

It is an expectation-maximization based modeling algorithm.

Its goal is to discover the correlation structure of the words, not the documents (or dataobjects).


Topic Modeling


Wordsd

n ≈

Topics

k

n

Topics(Importance)

k

k

RkT=[P(wordi | topicm)]

d

kD

ocum

ents

Doc

umen

ts

Topi

csD = [P(doci,wordj)]

Words

Topi

cs

k do

cum

ent b

asis

ve

ctor

s Δk

P(topicm)Prior Probability

Lk=[P(doci | topicm)]

k document basis vectors of

documentsDocument Term

Matrix


Topic Modeling


In PLSA, the generative process is inherently designed for dimensionality reductionrather than clustering, and different parts of the same document can be generated bydifferent mixture components.

It is assumed that there are k aspects (or latent topic) denoted by G1, . . . ,Gk .

The generative process builds document-term matrix as follows:

1 Select a latent component (aspect) Gm with probability P(Gm).2 Generate the indices (i , j) of a document-word pair (Di ,wj) with probabilities P(Di ∣Gm) and

P(wj ∣Gm), respectively. Increment the frequency of entry (i , j) in the document-term matrixby 1. The document and word indices are generated in an independent way.

All the parameters of this generative process, such as P(Gm),P(Di ∣Gm) and P(wj ∣Gm),need to be estimated from the observed frequencies in the n × d document-term matrix.


Topic Modeling


An important assumption in PLSA is that the selected documents and words areconditionally independent after the latent topical component Gm has been fixed:

P(Di ,wj ∣Gm) = P(Di ∣Gm)P(wj ∣Gm) (6)

This implies that the joint probability P(Di ,wj) for selecting a document-word pair canbe expressed in the following way:

P(Di ,wj) =k

∑m=1

P(Gm)P(Di ,wj ∣Gm) =k

∑m=1

P(Gm)P(Di ∣Gm)P(wj ∣Gm) (7)

Local independence between documents and words does not imply global independence.


Topic Modeling


In PLSA, the posterior probability P(Gm∣Di ,wj) of the latent component associated witha particular document-word pair is estimated.

The EM algorithm starts by initializing P(Gm), P(Di ∣Gm) and P(wj ∣Gm) to 1/k, 1/n,and 1/d .

k is the number of clusters, n the number of documents and d the number of words.


Topic Modeling


The algorithm iteratively executes the following E- and M-steps to convergence:

1 (E-step) Estimate posterior probability P(Gm∣Di ,wj) in terms of P(Gm), P(Di ∣Gm) andP(wj ∣Gm).

2 (M-step) Estimate P(Gm), P(Di ∣Gm) and P(wj ∣Gm) in terms of the posterior probabilityP(Gm∣Di ,wj), and observed data about word-document co-occurrence using log-likelihoodmaximization.


Topic Modeling

Probabilistic Latent Semantic Analysis - E-step

The posterior probability estimated in the E-step can be expanded using the Bayes rule:

P(Gm∣Di ,wj) =P(Gm)P(Di ,wj ∣Gm)

P(Di ,wj)(8)

Expanding the numerator via (6) and the denominator via (7), we obtain

P(Gm∣Di ,wj) =P(Gm)P(Di ∣Gm)P(wj ∣Gm)

∑kr=1 P(Gr)P(Di ∣Gr)P(wj ∣Gr)

(9)

This shows that the E-step can be implemented in terms of P(Gm),P(Di ∣Gm), andP(wj ∣Gm).


Topic Modeling

Probabilistic Latent Semantic Analysis - M-step

P(Gm∣Di ,wj) may be viewed as weights attached with word-document co-occurrencepairs for each aspect Gm.

These weights can be used to estimate P(Gm), P(Di ∣Gm) and P(wj ∣Gm) via thefollowing update rules (shown without proof):

P(Di ∣Gm)∝∑wj

f (Di ,wj)P(Gm∣Di ,wj) ∀i ∈ 1, . . . ,n, m ∈ 1, . . . , k (10)

P(wj ∣Gm)∝∑Di

f (Di ,wj)P(Gm∣Di ,wj) ∀j ∈ 1, . . . ,d , m ∈ 1, . . . , k (11)

P(Gm)∝∑Di

∑wj

f (Di ,wj)P(Gm∣Di ,wj) ∀m ∈ 1, . . . , k (12)

f (Di ,wj) is the observed frequency of word wj in document Di .D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 146 / 179

Topic Modeling

Probabilistic Latent Semantic Analysis - Dimensionality Reduction

The three key sets of parameters estimated by the M-step are P(Gm), P(Di ∣Gm) andP(wj ∣Gm).

These sets of parameters provide an SVD-like matrix factorization of the n × ddocument-term matrix D.

Assume D is scaled such that it scales to an aggregate probability of 1.

Therefore the (i , j)th entry of D can be viewed as an observed instantiation of theprobabilisitc quantity P(Di ,wj).

Let Lk be the n × k matrix for which the (i ,m)th entry is P(Di ∣Gm).

Let ∆k be the k × k diagonal matrix for which the mth diagonal entry is P(Gm).

Let Rk be the d × k matrix for which the (j ,m)th entry is P(wj ∣Gm).


Topic Modeling


Then the (i , j)th entry P(Di ,wj) of the matrix D can be expressed in terms of theentries of the aforementioned matrices according to (7), which is replicated here:

P(Di ,wj) =k

∑m=1

P(Gm)P(Di ,wj ∣Gm) =k

∑m=1

P(Gm)P(Di ∣Gm)P(wj ∣Gm) (13)

The left hand side is equal to entry (i , j) of D.

The right hand side is equal to entry (i , j) of QkΣkP⊺

k .

Depending on the number of components k , the left hand side can only approximate thematrix D, which is denoted by Dk .


Topic Modeling


In matrix notation, we then have: Dk = Lk∆kR⊺

k .

The transformed representation in k-dimensional space is Lk∆k .

The transformed representations will differ between PLSA and LSA. LSA optimizes themean-squared error and and PLSA maximizes the log-likelihood fit to a probabilisticgenerative model.

Both representations capture synonymy and polysemy.

In PLSA, unlike LSA, the columns of R⊺

k are non-negative and have a clear probabilisticmeaning. They allow to infer the topical words for the corresponding aspects.

In LSA, unlike PLSA, the transformation can be interpreted in terms of a rotation of anorthonormal axis system, which can also be applied to out-of-sample documents.


Topic Modeling

Probabilistic Latent Semantic Analysis - Limitations

Although the PLSA method is an intuitively sound model for probabilistic modeling, itdoes have a number of practical drawbacks.

The number of parameters grows linearly with the number of documents. Therefore,such an approach can be slow and may overfit the training data because of the largenumber of estimated parameters.

Furthermore, while PLSA provides a generative model of document-word pairs in thetraining data, it cannot easily assign probabilities to previously unseen documents.

In contrast, other EM models, such as Latent Dirichlet Allocation, transfer to unseendocuments as well.


2.3 Transduction

based on:

Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines. ICML 1999:200-209, Source of the figures in this section


Transduction

Known test set

Classification on text databases often means that we know all the data we will work withbefore training.

Hence the test set is known apriori.

This setting is called ’transductive’.

Can we define classifiers that exploit the known test set? Yes!

Transductive SVM (Joachims, ICML 1999)

Trains SVM on both training and test set

Uses test data to maximise margin


Transduction

Inductive vs. Transductive Classification

Task: predict label y from features x

Classic inductive setting

Strategy: Learn classifier on (labelled) training data

Goal: Classifier shall generalise to unseen data from same distribution

Transductive setting

Strategy: Learn classifier on (labelled) training data AND a given (unlabelled) testdataset

Goal: Predict class labels for this particular datasetD-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 153 / 179

Transduction

Why transduction?

Classic approach works: train on training dataset, test on test dataset

That is what we usually do in practice, for instance, in cross-validation.

We usually ignore or neglect that the fact that settings are transductive.

The benefits of transductive classification

Inductive setting: infinitely many potential classifiers

Transductive setting: finite number of equivalence classes of classifiers

f and f ′ in same equivalence class ⇔ f and f ′ classify points from training and testdataset identically


Transductive SVM

Learning-theoretic argument

Risk on Test data ≤ Risk on Training data + confidence interval (depends on number ofequivalence classes)

Theorem by Vapnik (1998): The larger the margin, the lower the number of equivalenceclasses that contain a classifier with this margin

Find hyperplane that separates classes in training data AND in test data with maximummargin.


Transductive SVM


Transductive SVM

salt andbasilparsleyatomphysicsnuclear

D1D2

D3

D4

D5

D6

1 1

1

1

1

1 1

1 1

1

1

1

1

1

1

1

1


Transductive SVM

Linearly separable case

minw ,b,y∗

1

2∥w∥

2

s.t. ∀ni=1 yi [w

⊺xi + b] ≥ 1

∀kj=1 y∗j [w

⊺x∗j + b] ≥ 1


Transductive SVM

Non-linearly separable case

minw ,b,y∗,ξ,ξ∗

1

2∥w∥

2+ C

n

∑i=0

ξi + C∗

k

∑j=0

ξ∗j

s.t. ∀ni=1 yi [w

⊺xi + b] ≥ 1 − ξi

∀kj=1 y∗j [w

⊺x∗j + b] ≥ 1 − ξ∗j

∀ni=1 ξi ≥ 0

∀kj=1 ξ

∗

j ≥ 0


Transductive SVM

Optimisation

How to solve this OP?

Not so ‘nice’: combination of integer and convex OP

Joachims’ approach: find approximate solution by iterative application of inductive SVM

1 Train inductive SVM on training data, predict on test data, assign labels to test data.2 Retrain on all data, with special slack weights for test data (C∗

−,C∗

+).

3 Outer loop: Repeat and slowly increase (C∗

−,C∗

+).

4 Inner loop: Within each repetition, switch pairs of ‘misclassified’ data points repeatedly.

Local search with approximate solution to OP


Transductive SVM: Optimization

Variant of inductive SVM

minw ,b,y∗,ξ,ξ∗

1

2∥w∥

2+ C

n

∑i=0

ξi + C∗

−

k

∑j ∶y∗j =−1

ξ∗j + C∗

+

k

∑j ∶y∗j =1

ξ∗j

s.t. ∀ni=1 yi [w

⊺xi + b] ≥ 1 − ξi

∀kj=1 y∗j [w

⊺x∗j + b] ≥ 1 − ξ∗j

Three different penalty costs

C for points from training dataset

C∗

−for points from in test dataset currently in class −1

C∗

+for points from in test dataset currently in class +1


Transductive SVM: Optimisation

+ +

+

++

++

--- --

-

ξj*ξj*

ξiξi

train, predict re-train

re-predict

+ +

+

++

++

--- --

-

- -

ξi


Transductive SVM: Experiments

Average P/R-breakeven point on the Reuters dataset for different training set sizes anda test size of 3,299

0

20

40

60

80

100

17 26 46 88 170 326 640 1200 2400 4801 9603

Aver

age

P/R

-bre

akev

en p

oint

Examples in training set

Transductive SVMSVM

Naive Bayes

0

10

20

30

40

50

60

70

80

90

100

206 412 825 1650 3299

Aver

age

P/R

-bre

akev

en p

oint

Examples in test set

Transductive SVMSVM

Naive Bayes



Average P/R-breakeven point on the Reuters dataset for 17 training documents andvarying test set size for the TSVM

0

20

40

60

80

100

17 26 46 88 170 326 640 1200 2400 4801 9603

Aver

age

P/R

-bre

akev

en p

oint


Transductive SVMSVM

Naive Bayes

0

10

20

30

40

50

60

70

80

90

100

206 412 825 1650 3299

Aver

age

P/R

-bre

akev

en p

oint

Examples in test set

Transductive SVMSVM

Naive Bayes



Average P/R-breakeven point on the WebKB category ’course’ for different training setsizes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss c

ours

e)


Transductive SVMSVM

Naive Bayes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss p

roje

ct)


Transductive SVMSVM

Naive Bayes



Average P/R-breakeven point on the WebKB category ’project’ for different training setsizes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss c

ours

e)


Transductive SVMSVM

Naive Bayes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss p

roje

ct)


Transductive SVMSVM

Naive Bayes


Transductive SVM: Summary

Results

Transductive version of SVM

Maximizes margin on training and test data

Implementation uses variant of classic inductive SVM

Solution is approximate and fast

Works well on text, in particular on small training samples and large test sets


2.4 Cotraining

based on:

Avrim Blum, Tom M. Mitchell, Combining Labeled and Unlabeled Data with Co-Training. COLT 1998: 92-100


Cotraining

Goals

To understand that hyperlinks define a second view of documents.

To understand that this view can be used to infer class labels for an augmented trainingdataset and improved prediction accuracy.

To understand how this concept of cotraining generalizes to other domains.


Cotraining

Motivation

In text mining: Besides their content in form of words, texts nowadays carry hyperlinksthat point to related pages. Can this second type of information on a website be used toimprove classification?

In general: How to improve classification if there is plenty of unlabeled data in form of asecond view of the data?

Yes, the second view can be used to infer class labels of unlabeled data points, toaugment the training dataset.


Cotraining

Classic cotraining algorithm

Blum and Mitchell’s cotraining uses two classifiers, trained on separate views of thedata, to create pseudo-label for those unlabeled data points for which the predictors aremost confident about their predictions.

The pseudolabels are then used to retrain the classifiers, before repeating thepseudolabel generation.

The entire process is repeated in k iterations.


Cotraining

+

++

+

-

--

x1 x2

L

x1 x2

?

?????

?

x1 x2

?

???

+

++

+

-

--

x1 x2

+

-h1 h2

x1 x2

x1 x2+

-

+-

+-

h1

h2

?

L'

U

U'train

classifysample

add

x1 x2x1 x2

11

12 13

14


Cotraining: Pseudocode

Source: Blum and Mitchell, 1998


Cotraining

Why can unlabeled data help at all?

Assume an instance space X = X1 ×X2, where X1 and X2 are different views of the data.

Each view is assumed to be sufficient for correct classification.

Let D be a distribution over X and let C1 and C2 be concept classes defined over X1

and X2, respectively.

We assume that all labels on examples with non-zero probability under D are consistentwith some target function f1 ∈ C1 and f2 ∈ C2.

If f denotes the combined target concept over the entire example, then for any examplex = (x1, x2) observed with label l , we have f (x) = f1(x1) = f2(x2) = l .

This means that D assigns probability zero to any example (x1, x2) such thatf1(x1) ≠ f2(x2).


Cotraining

Why can unlabeled data help at all?

For a given D over X , we define f = (f1, f2) ∈ C1 × C2 as being compatible with D if itsatisfies the condition that D assigns zero probability to the set of examples (x1, x2)

such that f1(x1) ≠ f2(x2).

The set of compatible target functions is typically much simpler and smaller than theentire concept class they are from.

As in the transductive SVM, a reduction in the equivalence classes of the targetfunctions leads to an improved bound on the test error!


Cotraining: Graph Representation of Key Idea

Source: Blum and Mitchell, 1998


Cotraining

When does the specific approach of Blum and Mitchell work?

Cotraining was shown to work if

the two views X1 and X2 are both sufficient to learn the target function,the two views are conditionally independent given the class label P(X1∣Y ) á P(X2∣Y ).


Cotraining

Error reduction with training dataset augmentation (Naive Bayes, p = 1, n = 3, k = 30, u = 75), Source: Blum and

Mitchell, 1998

0

0.05

0.1

0.15

0.2

0.25

0.3

0 5 10 15 20 25 30 35 40

Perc

ent E

rror o

n Te

st D

ata

Co-Training Iterations

Hyperlink-BasedPage-Based

Default


Cotraining

Summary

Cotraining is a mechanism to augment the labeled training dataset when two data viewsare available

In the original work by Blum and Mitchell (1998), cotraining was shown to work if thetwo views are independent given the class class and each view is sufficient for learningthe target concept.


Documents

2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v