62
2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179

2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Embed Size (px)

Citation preview

Page 1: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

2. Text Mining

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179

Page 2: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text Mining

Goals

To learn key problems and techniques in the mining one of the most common types ofdata

To learn how to represent text numerically

To learn how to make use of enormous amounts of unlabeled data

To learn how to find co-occurring keywords in documents

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 119 / 179

Page 3: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

2.1 Basics of Text Representation and Analysis

based on:

Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 13

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 120 / 179

Page 4: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

What is text mining?

Definition

Text mining is the use of automated methods for exploiting the enormous amount ofknowledge available in the (biomedical) literature.

Motivation

Most knowledge is stored in terms of texts, both in industry and in academia.

This alone makes text mining an integral part of knowledge discovery!

Furthermore, to make text machine-readable, one has to solve several recognition(mining) tasks on text.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 121 / 179

Page 5: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Why text mining?

Text data is growing in an unprecedented manner

Digital libraries

Web and Web-enabled applications (e.g. Social networks)

Newswire services

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 122 / 179

Page 6: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text mining terminology

Important definitions

A set of features of text is also referred to as a lexicon.

A document can be either viewed as a sequence or multidimensional record.

A collection of documents is referred to as a corpus.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 123 / 179

Page 7: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text mining terminology

Number of special characteristics of text data

Very sparse

Diverse length

Nonnegative statistics

Side information is often available, e.g. Hyperlink, meta-data

Lots of unlabeled data

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 124 / 179

Page 8: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

What is text mining?

Common tasks

Information retrieval: Find documents that are relevant to a user, or to a query in acollection of documents

Document ranking: rank all documents in the collectionDocument selection: classify documents into relevant and irrelevant

Information filtering: Search newly created documents for information that is relevant toa user

Document classification: Assign a document to a category that describes its content

Keyword co-occurrence: Find groups of keywords that co-occur in many documents

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 125 / 179

Page 9: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Evaluation text mining

Precision and Recall

Let the set of documents that are relevant to a query be denoted as {Relevant} and theset of retrieved documents as {Retrieved}.

The precision is the percentage of retrieved documents that are relevant to the query

precision =∣{Relevant} ∩ {Retrieved}∣

∣{Retrieved}∣(1)

The recall is the percentage of relevant documents that were retrieved by the query:

recall =∣{Relevant} ∩ {Retrieved}∣

∣{Relevant}∣(2)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 126 / 179

Page 10: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text representation

Tokenization

Tokenization is the process of identifying keywords in a document.

Not all words in a text are relevant.

Text mining ignores stop words.

Stop words form the stop list.

Stop lists are context-dependent.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 127 / 179

Page 11: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text representation

Vector space model

Given #d documents and #t terms.

Model each document as a vector v in a t-dimensional space.

Weighted term-frequency matrix

Matrix TF of size #d ×#t

Entries measure association of term and document

If a term t does not occur in a document d , then TF (d , t) = 0.

If a term t does occur in a document d , then TF (d , t) > 0.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 128 / 179

Page 12: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text representation

Definitions of term frequency

If term t occurs in document d , then

TF (d , t) = 1TF (d , t) = frequency of t in d (freq(d,t))

TF (d , t) =freq(d,t)

∑t′∈T freq(d,t’)

TF (d , t) =

⎧⎪⎪⎨⎪⎪⎩

1 + log(freq(d,t)) freq(d,t) > 0,

0 freq(d,t) = 0.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 129 / 179

Page 13: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text representation

Inverse document frequency

The inverse document frequency (IDF) represents the scaling factor, or importance, of aterm.

A term that appears in many documents is scaled down:

IDF (t) = log1 + ∣d ∣

∣dt ∣, (3)

where ∣d ∣ is the number of all documents, and ∣dt ∣ is the number of documentscontaining term t.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 130 / 179

Page 14: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Text representation

TF-IDF measure

The TF-IDF measure is the product of term frequency and inverse document frequency:

TF -IDF (d , t) = TF (d , t)IDF (t). (4)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 131 / 179

Page 15: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Measuring similarity

Cosine measure

Let v1 and v2 be two document vectors.

The cosine similarity is defined as

sim(v1,v2) =v⊺1v2

∣v1∣∣v2∣. (5)

Kernels

Depending on how we represent a document, there are many kernels available formeasuring similarity of these representations:

vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel,one long string: string kernels that count common k-mers in two strings.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 132 / 179

Page 16: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

2.2 Topic Modeling

based on:

Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.4.3 and 13.4

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 133 / 179

Page 17: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Definition

Topic modeling can be viewed as a probabilistic version of latent semantic analysis(LSA).

Its most basic version is referred to as Probabilistic Latent Semantic Analysis (PLSA).

It provides an alternative method for performing dimensionality reduction and hasseveral advantages over LSA.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 134 / 179

Page 18: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling: SVD on text

Latent Semantic Analysis

Latent Semantic Analysis (LSA) is an application of SVD to the text domain.

The goal is to retrieve a vectorial representation of terms and documents.

The data matrix D is an n × d document-term matrix containing word frequencies in then documents, where d is the size of the lexicon.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 135 / 179

Page 19: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling: SVD on text

Latent Semantic Analysis

Wordsd

n ≈

Topics

k

n

Topics(Importance)

k

k

RkT

d

k

Doc

umen

ts

Doc

umen

ts

Topi

csD

Words

Topi

cs

k do

cum

ent b

asis

ve

ctor

s Δk

ΔkLk

k document basis vectors of

documentsDocument Term

Matrix

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 136 / 179

Page 20: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling: Centering and sparsity

Latent Semantic Analysis

No mean centering is used.

The results are approximately the same as for PCA because of the sparsity of D: Thesparsity implies that most of the entries are zero, and that the mean is much smallerthan the non-zero entries. In such scenarios, it can be shown that the covariance matrixis approximately proportional to D⊺D.

The sparsity of the data also results in a low intrinsic dimensionality.

The dimensionality reduction effect of LSA is rather drastic: Often, a corpus representedon a lexicon on 100,000 dimensions can be summarized in fewer than 300 dimensions.

LSA is also a classic example of how to the ”loss” of information from discarding somedimensions can actually result in an improvement in the quality of the datarepresentation.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 137 / 179

Page 21: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling: Synonymy and polysemy

Latent Semantic Analysis

Synonymy refers to the fact that two words have the same meaning, e.g. comical andhilarious.

Polysemy refers to the fact that the same word has two different meanings, e.g. jaguar.

Typically the meaning of a word can be understood from its context, but frequencyterms do not capture the context sufficiently, e.g. two documents containing the wordscomical and hilarious may not be deemed sufficiently similar.

The truncated representation after LSA typically removes the noise of effects ofsynonymy and polysemy because the singular vectors represent the direction ofcorrelation in the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 138 / 179

Page 22: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis (PLSA) is a probabilistic variant of LSA and SVD.

It is an expectation-maximization based modeling algorithm.

Its goal is to discover the correlation structure of the words, not the documents (or dataobjects).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 139 / 179

Page 23: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis

Wordsd

n ≈

Topics

k

n

Topics(Importance)

k

k

RkT=[P(wordi | topicm)]

d

kD

ocum

ents

Doc

umen

ts

Topi

csD = [P(doci,wordj)]

Words

Topi

cs

k do

cum

ent b

asis

ve

ctor

s Δk

P(topicm)Prior Probability

Lk=[P(doci | topicm)]

k document basis vectors of

documentsDocument Term

Matrix

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 140 / 179

Page 24: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis

In PLSA, the generative process is inherently designed for dimensionality reductionrather than clustering, and different parts of the same document can be generated bydifferent mixture components.

It is assumed that there are k aspects (or latent topic) denoted by G1, . . . ,Gk .

The generative process builds document-term matrix as follows:

1 Select a latent component (aspect) Gm with probability P(Gm).2 Generate the indices (i , j) of a document-word pair (Di ,wj) with probabilities P(Di ∣Gm) and

P(wj ∣Gm), respectively. Increment the frequency of entry (i , j) in the document-term matrixby 1. The document and word indices are generated in an independent way.

All the parameters of this generative process, such as P(Gm),P(Di ∣Gm) and P(wj ∣Gm),need to be estimated from the observed frequencies in the n × d document-term matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 141 / 179

Page 25: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis

An important assumption in PLSA is that the selected documents and words areconditionally independent after the latent topical component Gm has been fixed:

P(Di ,wj ∣Gm) = P(Di ∣Gm)P(wj ∣Gm) (6)

This implies that the joint probability P(Di ,wj) for selecting a document-word pair canbe expressed in the following way:

P(Di ,wj) =k

∑m=1

P(Gm)P(Di ,wj ∣Gm) =k

∑m=1

P(Gm)P(Di ∣Gm)P(wj ∣Gm) (7)

Local independence between documents and words does not imply global independence.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 142 / 179

Page 26: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis

In PLSA, the posterior probability P(Gm∣Di ,wj) of the latent component associated witha particular document-word pair is estimated.

The EM algorithm starts by initializing P(Gm), P(Di ∣Gm) and P(wj ∣Gm) to 1/k, 1/n,and 1/d .

k is the number of clusters, n the number of documents and d the number of words.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 143 / 179

Page 27: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis

The algorithm iteratively executes the following E- and M-steps to convergence:

1 (E-step) Estimate posterior probability P(Gm∣Di ,wj) in terms of P(Gm), P(Di ∣Gm) andP(wj ∣Gm).

2 (M-step) Estimate P(Gm), P(Di ∣Gm) and P(wj ∣Gm) in terms of the posterior probabilityP(Gm∣Di ,wj), and observed data about word-document co-occurrence using log-likelihoodmaximization.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 144 / 179

Page 28: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis - E-step

The posterior probability estimated in the E-step can be expanded using the Bayes rule:

P(Gm∣Di ,wj) =P(Gm)P(Di ,wj ∣Gm)

P(Di ,wj)(8)

Expanding the numerator via (6) and the denominator via (7), we obtain

P(Gm∣Di ,wj) =P(Gm)P(Di ∣Gm)P(wj ∣Gm)

∑kr=1 P(Gr)P(Di ∣Gr)P(wj ∣Gr)

(9)

This shows that the E-step can be implemented in terms of P(Gm),P(Di ∣Gm), andP(wj ∣Gm).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 145 / 179

Page 29: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis - M-step

P(Gm∣Di ,wj) may be viewed as weights attached with word-document co-occurrencepairs for each aspect Gm.

These weights can be used to estimate P(Gm), P(Di ∣Gm) and P(wj ∣Gm) via thefollowing update rules (shown without proof):

P(Di ∣Gm)∝∑wj

f (Di ,wj)P(Gm∣Di ,wj) ∀i ∈ 1, . . . ,n, m ∈ 1, . . . , k (10)

P(wj ∣Gm)∝∑Di

f (Di ,wj)P(Gm∣Di ,wj) ∀j ∈ 1, . . . ,d , m ∈ 1, . . . , k (11)

P(Gm)∝∑Di

∑wj

f (Di ,wj)P(Gm∣Di ,wj) ∀m ∈ 1, . . . , k (12)

f (Di ,wj) is the observed frequency of word wj in document Di .D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 146 / 179

Page 30: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis - Dimensionality Reduction

The three key sets of parameters estimated by the M-step are P(Gm), P(Di ∣Gm) andP(wj ∣Gm).

These sets of parameters provide an SVD-like matrix factorization of the n × ddocument-term matrix D.

Assume D is scaled such that it scales to an aggregate probability of 1.

Therefore the (i , j)th entry of D can be viewed as an observed instantiation of theprobabilisitc quantity P(Di ,wj).

Let Lk be the n × k matrix for which the (i ,m)th entry is P(Di ∣Gm).

Let ∆k be the k × k diagonal matrix for which the mth diagonal entry is P(Gm).

Let Rk be the d × k matrix for which the (j ,m)th entry is P(wj ∣Gm).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 147 / 179

Page 31: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis - Dimensionality Reduction

Then the (i , j)th entry P(Di ,wj) of the matrix D can be expressed in terms of theentries of the aforementioned matrices according to (7), which is replicated here:

P(Di ,wj) =k

∑m=1

P(Gm)P(Di ,wj ∣Gm) =k

∑m=1

P(Gm)P(Di ∣Gm)P(wj ∣Gm) (13)

The left hand side is equal to entry (i , j) of D.

The right hand side is equal to entry (i , j) of QkΣkP⊺

k .

Depending on the number of components k , the left hand side can only approximate thematrix D, which is denoted by Dk .

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 148 / 179

Page 32: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis - Dimensionality Reduction

In matrix notation, we then have: Dk = Lk∆kR⊺

k .

The transformed representation in k-dimensional space is Lk∆k .

The transformed representations will differ between PLSA and LSA. LSA optimizes themean-squared error and and PLSA maximizes the log-likelihood fit to a probabilisticgenerative model.

Both representations capture synonymy and polysemy.

In PLSA, unlike LSA, the columns of R⊺

k are non-negative and have a clear probabilisticmeaning. They allow to infer the topical words for the corresponding aspects.

In LSA, unlike PLSA, the transformation can be interpreted in terms of a rotation of anorthonormal axis system, which can also be applied to out-of-sample documents.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 149 / 179

Page 33: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Topic Modeling

Probabilistic Latent Semantic Analysis - Limitations

Although the PLSA method is an intuitively sound model for probabilistic modeling, itdoes have a number of practical drawbacks.

The number of parameters grows linearly with the number of documents. Therefore,such an approach can be slow and may overfit the training data because of the largenumber of estimated parameters.

Furthermore, while PLSA provides a generative model of document-word pairs in thetraining data, it cannot easily assign probabilities to previously unseen documents.

In contrast, other EM models, such as Latent Dirichlet Allocation, transfer to unseendocuments as well.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 150 / 179

Page 34: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

2.3 Transduction

based on:

Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines. ICML 1999:200-209, Source of the figures in this section

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 151 / 179

Page 35: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transduction

Known test set

Classification on text databases often means that we know all the data we will work withbefore training.

Hence the test set is known apriori.

This setting is called ’transductive’.

Can we define classifiers that exploit the known test set? Yes!

Transductive SVM (Joachims, ICML 1999)

Trains SVM on both training and test set

Uses test data to maximise margin

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 152 / 179

Page 36: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transduction

Inductive vs. Transductive Classification

Task: predict label y from features x

Classic inductive setting

Strategy: Learn classifier on (labelled) training data

Goal: Classifier shall generalise to unseen data from same distribution

Transductive setting

Strategy: Learn classifier on (labelled) training data AND a given (unlabelled) testdataset

Goal: Predict class labels for this particular datasetD-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 153 / 179

Page 37: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transduction

Why transduction?

Classic approach works: train on training dataset, test on test dataset

That is what we usually do in practice, for instance, in cross-validation.

We usually ignore or neglect that the fact that settings are transductive.

The benefits of transductive classification

Inductive setting: infinitely many potential classifiers

Transductive setting: finite number of equivalence classes of classifiers

f and f ′ in same equivalence class ⇔ f and f ′ classify points from training and testdataset identically

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 154 / 179

Page 38: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM

Learning-theoretic argument

Risk on Test data ≤ Risk on Training data + confidence interval (depends on number ofequivalence classes)

Theorem by Vapnik (1998): The larger the margin, the lower the number of equivalenceclasses that contain a classifier with this margin

Find hyperplane that separates classes in training data AND in test data with maximummargin.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 155 / 179

Page 39: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 156 / 179

Page 40: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM

salt andbasilparsleyatomphysicsnuclear

D1D2

D3

D4

D5

D6

1 1

1

1

1

1 1

1 1

1

1

1

1

1

1

1

1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 157 / 179

Page 41: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM

Linearly separable case

minw ,b,y∗

1

2∥w∥

2

s.t. ∀ni=1 yi [w

⊺xi + b] ≥ 1

∀kj=1 y∗j [w

⊺x∗j + b] ≥ 1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 158 / 179

Page 42: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM

Non-linearly separable case

minw ,b,y∗,ξ,ξ∗

1

2∥w∥

2+ C

n

∑i=0

ξi + C∗

k

∑j=0

ξ∗j

s.t. ∀ni=1 yi [w

⊺xi + b] ≥ 1 − ξi

∀kj=1 y∗j [w

⊺x∗j + b] ≥ 1 − ξ∗j

∀ni=1 ξi ≥ 0

∀kj=1 ξ

j ≥ 0

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 159 / 179

Page 43: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM

Optimisation

How to solve this OP?

Not so ‘nice’: combination of integer and convex OP

Joachims’ approach: find approximate solution by iterative application of inductive SVM

1 Train inductive SVM on training data, predict on test data, assign labels to test data.2 Retrain on all data, with special slack weights for test data (C∗

−,C∗

+).

3 Outer loop: Repeat and slowly increase (C∗

−,C∗

+).

4 Inner loop: Within each repetition, switch pairs of ‘misclassified’ data points repeatedly.

Local search with approximate solution to OP

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 160 / 179

Page 44: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM: Optimization

Variant of inductive SVM

minw ,b,y∗,ξ,ξ∗

1

2∥w∥

2+ C

n

∑i=0

ξi + C∗

k

∑j ∶y∗j =−1

ξ∗j + C∗

+

k

∑j ∶y∗j =1

ξ∗j

s.t. ∀ni=1 yi [w

⊺xi + b] ≥ 1 − ξi

∀kj=1 y∗j [w

⊺x∗j + b] ≥ 1 − ξ∗j

Three different penalty costs

C for points from training dataset

C∗

−for points from in test dataset currently in class −1

C∗

+for points from in test dataset currently in class +1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 161 / 179

Page 45: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM: Optimisation

+ +

+

++

++

--- --

-

ξj*ξj*

ξiξi

train, predict re-train

re-predict

+ +

+

++

++

--- --

-

- -

ξi

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 162 / 179

Page 46: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM: Experiments

Average P/R-breakeven point on the Reuters dataset for different training set sizes anda test size of 3,299

0

20

40

60

80

100

17 26 46 88 170 326 640 1200 2400 4801 9603

Aver

age

P/R

-bre

akev

en p

oint

Examples in training set

Transductive SVMSVM

Naive Bayes

0

10

20

30

40

50

60

70

80

90

100

206 412 825 1650 3299

Aver

age

P/R

-bre

akev

en p

oint

Examples in test set

Transductive SVMSVM

Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 163 / 179

Page 47: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM: Experiments

Average P/R-breakeven point on the Reuters dataset for 17 training documents andvarying test set size for the TSVM

0

20

40

60

80

100

17 26 46 88 170 326 640 1200 2400 4801 9603

Aver

age

P/R

-bre

akev

en p

oint

Examples in training set

Transductive SVMSVM

Naive Bayes

0

10

20

30

40

50

60

70

80

90

100

206 412 825 1650 3299

Aver

age

P/R

-bre

akev

en p

oint

Examples in test set

Transductive SVMSVM

Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 164 / 179

Page 48: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM: Experiments

Average P/R-breakeven point on the WebKB category ’course’ for different training setsizes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss c

ours

e)

Examples in training set

Transductive SVMSVM

Naive Bayes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss p

roje

ct)

Examples in training set

Transductive SVMSVM

Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 165 / 179

Page 49: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM: Experiments

Average P/R-breakeven point on the WebKB category ’project’ for different training setsizes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss c

ours

e)

Examples in training set

Transductive SVMSVM

Naive Bayes

0

20

40

60

80

100

9 16 29 57 113 226

P/R

-bre

akev

en p

oint

(cla

ss p

roje

ct)

Examples in training set

Transductive SVMSVM

Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 166 / 179

Page 50: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Transductive SVM: Summary

Results

Transductive version of SVM

Maximizes margin on training and test data

Implementation uses variant of classic inductive SVM

Solution is approximate and fast

Works well on text, in particular on small training samples and large test sets

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 167 / 179

Page 51: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

2.4 Cotraining

based on:

Avrim Blum, Tom M. Mitchell, Combining Labeled and Unlabeled Data with Co-Training. COLT 1998: 92-100

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 168 / 179

Page 52: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

Goals

To understand that hyperlinks define a second view of documents.

To understand that this view can be used to infer class labels for an augmented trainingdataset and improved prediction accuracy.

To understand how this concept of cotraining generalizes to other domains.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 169 / 179

Page 53: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

Motivation

In text mining: Besides their content in form of words, texts nowadays carry hyperlinksthat point to related pages. Can this second type of information on a website be used toimprove classification?

In general: How to improve classification if there is plenty of unlabeled data in form of asecond view of the data?

Yes, the second view can be used to infer class labels of unlabeled data points, toaugment the training dataset.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 170 / 179

Page 54: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

Classic cotraining algorithm

Blum and Mitchell’s cotraining uses two classifiers, trained on separate views of thedata, to create pseudo-label for those unlabeled data points for which the predictors aremost confident about their predictions.

The pseudolabels are then used to retrain the classifiers, before repeating thepseudolabel generation.

The entire process is repeated in k iterations.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 171 / 179

Page 55: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

+

++

+

-

--

x1 x2

L

x1 x2

?

?????

?

x1 x2

?

???

+

++

+

-

--

x1 x2

+

-h1 h2

x1 x2

x1 x2+

-

+-

+-

h1

h2

?

L'

U

U'train

classifysample

add

x1 x2x1 x2

11

12 13

14

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 172 / 179

Page 56: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining: Pseudocode

Source: Blum and Mitchell, 1998

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 173 / 179

Page 57: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

Why can unlabeled data help at all?

Assume an instance space X = X1 ×X2, where X1 and X2 are different views of the data.

Each view is assumed to be sufficient for correct classification.

Let D be a distribution over X and let C1 and C2 be concept classes defined over X1

and X2, respectively.

We assume that all labels on examples with non-zero probability under D are consistentwith some target function f1 ∈ C1 and f2 ∈ C2.

If f denotes the combined target concept over the entire example, then for any examplex = (x1, x2) observed with label l , we have f (x) = f1(x1) = f2(x2) = l .

This means that D assigns probability zero to any example (x1, x2) such thatf1(x1) ≠ f2(x2).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 174 / 179

Page 58: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

Why can unlabeled data help at all?

For a given D over X , we define f = (f1, f2) ∈ C1 × C2 as being compatible with D if itsatisfies the condition that D assigns zero probability to the set of examples (x1, x2)

such that f1(x1) ≠ f2(x2).

The set of compatible target functions is typically much simpler and smaller than theentire concept class they are from.

As in the transductive SVM, a reduction in the equivalence classes of the targetfunctions leads to an improved bound on the test error!

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 175 / 179

Page 59: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining: Graph Representation of Key Idea

Source: Blum and Mitchell, 1998

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 176 / 179

Page 60: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

When does the specific approach of Blum and Mitchell work?

Cotraining was shown to work if

the two views X1 and X2 are both sufficient to learn the target function,the two views are conditionally independent given the class label P(X1∣Y ) á P(X2∣Y ).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 177 / 179

Page 61: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

Error reduction with training dataset augmentation (Naive Bayes, p = 1, n = 3, k = 30, u = 75), Source: Blum and

Mitchell, 1998

0

0.05

0.1

0.15

0.2

0.25

0.3

0 5 10 15 20 25 30 35 40

Perc

ent E

rror o

n Te

st D

ata

Co-Training Iterations

Hyperlink-BasedPage-Based

Default

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 178 / 179

Page 62: 2. Text Mining - ETH Zürich€¦ · 2. Text Mining D-BSSE Karsten ... To learn how to nd co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, ... sim(v

Cotraining

Summary

Cotraining is a mechanism to augment the labeled training dataset when two data viewsare available

In the original work by Blum and Mitchell (1998), cotraining was shown to work if thetwo views are independent given the class class and each view is sufficient for learningthe target concept.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 179 / 179