Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.:...

Preview:

Citation preview

.: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 4: Text Mining

Karsten Borgwardt

February 21 to March 4, 2011

Machine Learning & Computational Biology Research GroupMPIs Tübingen

What is text mining?

.: Data Mining in Bioinformatics, Page 2

DefinitionText mining is the use of automated methods for exploit-ing the enormous amount of knowledge available in the(biomedical) literature.

MotivationMost knowledge is stored in terms of texts, both in in-dustry and in academiaThis alone makes text mining an integral part of knowl-edge discovery!Furthermore, to make text machine-readable, one hasto solve several recognition (mining) tasks on text

What is text mining?

.: Data Mining in Bioinformatics, Page 3

Common tasksInformation retrieval: Find documents that are relevantto a user, or to a query in a collection of documents

Document ranking: rank all documents in the collec-tionDocument selection: classify documents into relevantand irrelevant

Information filtering: Search newly created documentsfor information that is relevant to a userDocument classification: Assign a document to a cate-gory that describes its contentKeyword co-occurrence: Find groups of keywords thatco-occur in many documents

Evaluating text mining

.: Data Mining in Bioinformatics, Page 4

Precision and RecallLet the set of documents that are relevant to a querybe denoted as {Relevant} and the set of retrieved doc-uments as {Retrieved}.The precision is the percentage of retrieved documentsthat are relevant to the query

precision =|{Relevant} ∩ {Retrieved}|

|{Retrieved}|(1)

The recall is the percentage of relevant documents thatwere retrieved by the query:

recall =|{Relevant} ∩ {Retrieved}|

|{Relevant}|(2)

Text representation

.: Data Mining in Bioinformatics, Page 5

TokenizationProcess of identifying keywords in a documentNot all words in a text are relevantText mining ignores stop wordsStop words form the stop listStop lists are context-dependent

Text representation

.: Data Mining in Bioinformatics, Page 6

Vector space modelGiven #d documents and #t termsModel each document as a vector v in a t-dimensionalspace

Weighted Term-frequency matrixMatrix TF of size #d×#t

Entries measure association of term and documentIf a term t does not occur in a document d, thenTF (d, t) = 0

If a term t does occur in a document d, then TF (d, t) >0.

Text representation

.: Data Mining in Bioinformatics, Page 7

If term t occurs in document d, thenTF (d, t) = 1

TF (d, t) = frequency of t in d (freq(d, t))

TF (d, t) = freq(d,t)∑t′∈T freq(d,t

′)

TF (d, t) = 1 + log(1 + log(freq(d, t)))

Text representation

.: Data Mining in Bioinformatics, Page 8

Inverse document frequencyrepresents the scaling factor, or importance, of a termA term that appears in many document is scaled down

IDF (t) = log1 + |d||dt|

(3)

where |d| is the number of all documents, and |dt| is thenumber of documents containing term t

TF-IDF measureProduct of term frequency and inverse document fre-quency:

TF -IDF (d, t) = TF (d, t)IDF (t); (4)

Measuring similarity

.: Data Mining in Bioinformatics, Page 9

Cosine measureLet v1 and v2 be two document vectors.The cosine similarity is defined as

sim(v1, v2) =v>1 v2|v1||v2|

(5)

Kernelsdepending on how we represent a document, there aremany kernels available for measuring similarity of theserepresentations

vectorial representation: vector kernels like linear,polynomial, Gaussian RBF kernelone long string: string kernels that count common k-mers in two strings (more on that later in the course)

Keyword co-occurrence

.: Data Mining in Bioinformatics, Page 10

ProblemFind sets of keyword that often co-occurCommon problem in biomedical literature: find associ-ations between genes, proteins or other entities usingco-occurrence searchKeyword co-occurrence search is an instance of a moregeneral problem in data mining, called association rulemining.

Association rules

.: Data Mining in Bioinformatics, Page 11

DefinitionsLet I = {I1, I2, . . . , Im} be a set of items (keywords)Let D be the database of transactions T (collection ofdocuments)A transaction T ∈ D is a set of items: T ⊆ I (a docu-ment is a set of keywords)Let A be a set of items: A ⊆ T . An association rule isan implication of the form

A ⊆ T ⇒ B ⊆ T, (6)

where A,B ⊆ I and A ∩B = ∅

Association rules

.: Data Mining in Bioinformatics, Page 12

Support and ConfidenceThe rule A⇒ B holds in the transaction set D with sup-port s, where s is the percentage of transactions in Dthat contain A ∪B:

support(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}|

|{T ∈ D}|(7)

The rule A ⇒ B has confidence c in the transactionset D, where c is the percentage of transactions in Dcontaining A that also contain B:

confidence(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}||{T ∈ D|A ⊆ T}|

(8)

Association rules

.: Data Mining in Bioinformatics, Page 13

Strong rulesRules that satisfy both a minimum support thresh-old (minsup) and a minimum confidence threshold(minconf) are called strong association rules— andthese are the ones we are after!

Finding strong rules1. Search for all frequent itemsets (set of items that occur

in at least minsup % of all transactions)2. Generate strong association rules from the frequent

itemsets

Association rules

.: Data Mining in Bioinformatics, Page 14

Apriori algorithmMakes use of the Apriori property: If an itemset A isfrequent, then any subset B of A (B ⊆ A) is frequentas well. If B is infrequent, then any superset A of B(A ⊇ B) is infrequent as well.

Steps1. Determine frequent items = k-itemsets with k = 1

2. Join all pairs of frequent k-itemsets that differ in at most1 item = candidatesCk+1 for being frequent k+1 itemsets

3. Check the frequency of these candidates Ck+1: the fre-quent ones form the frequent k + 1-itemsets (trick: dis-card any candidate immediately that contains an infre-quent k-itemset)

4. Repeat from Step 2 until no more candidate is frequent

Transduction

.: Data Mining in Bioinformatics, Page 15

Known test setClassification on text databases often means that weknow all the data we will work with before trainingHence the test set is known aprioriThis setting is called ’transductive’Can we define classifiers that exploit the known test set?Yes!

Transductive SVM (Joachims, ICML 1999)Trains SVM on both training and test setUses test data to maximise margin

Inductive vs. transductive

.: Data Mining in Bioinformatics, Page 16

ClassificationTask: predict label y from features x

Classic inductive settingStrategy: Learn classifier on (labelled) training dataGoal: Classifier shall generalise to unseen data fromsame distribution

Transductive settingStrategy: Learn classifier on (labelled) training dataAND a given (unlabelled) test datasetGoal: Predict class labels for this particular dataset

Why transduction?

.: Data Mining in Bioinformatics, Page 17

Really necessary?Classic approach works: train on training dataset, teston test datasetThat is what we usually do in practice, for instance, incross-validation.We usually ignore or neglect that the fact that settingsare transductive.

The benefits of transductive classificationInductive setting: infinitely many potential classifiersTransductive setting: finite number of equivalenceclasses of classifiersf and f ′ in same equivalence class⇔ f and f ′ classifypoints from training and test dataset identically

Why transduction?

.: Data Mining in Bioinformatics, Page 18

Idea of Transductive SVMsRisk on Test data ≤ Risk on Training data + confidenceinterval (depends on number of equivalence classes)Theorem by Vapnik(1998): The larger the margin, thelower the number of equivalence classes that contain aclassifier with this marginFind hyperplane that separates classes in training dataAND in test data with maximum margin.

Why transduction?

.: Data Mining in Bioinformatics, Page 19

Transduction on text

.: Data Mining in Bioinformatics, Page 20

Transductive SVM

.: Data Mining in Bioinformatics, Page 21

Linearly separable case

minw,b,y∗

1

2‖w‖2

s.t. ∀ni=1 yi[w>xi + b] ≥ 1

∀kj=1 y∗j [w>x∗j + b] ≥ 1

Transductive SVM

.: Data Mining in Bioinformatics, Page 22

Non-linearly separable case

minw,b,y∗,ξ,ξ∗

1

2‖w‖2 + C

n∑i=0

ξi + C∗k∑j=0

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

∀ni=1 ξi ≥ 0

∀kj=1 ξ∗j ≥ 0

Transductive SVM

.: Data Mining in Bioinformatics, Page 23

OptimisationHow to solve this OP?Not so nice: combination of integer and convex OPJoachims’ approach: find approximate solution by itera-tive application of inductive SVM

train inductive SVM on training data, predict on testdata, assign labels to test dataretrain on all data, with special slack weights for testdata (C∗−, C

∗+)

Outer loop: repeat and slowly increase (C∗−, C∗+)

Inner loop: within each repetition switch pairs of ’mis-classified’ data points repeatedly

Local search with approximate solution to OP

Inductive SVM for TSVM

.: Data Mining in Bioinformatics, Page 24

Variant of inductive SVM

minw,b,y∗,ξ,ξ∗

1

2‖w‖2 + C

n∑i=0

ξi + C∗−

k∑j:y∗j=−1

ξ∗j + C∗+

k∑j:y∗j=1

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

Three different penalty costsC for points from training datasetC∗− for points from in test dataset currently in class −1C∗+ for points from in test dataset currently in class +1

Experiments

.: Data Mining in Bioinformatics, Page 25

Average P/R-breakeven point on the Reuters dataset fordifferent training set sizes and a test size of 3,299

Experiments

.: Data Mining in Bioinformatics, Page 26

Average P/R-breakeven point on the Reuters dataset for 17training documents and varying test set size for the TSVM

Experiments

.: Data Mining in Bioinformatics, Page 27

Average P/R-breakeven point on the WebKB category’course’ for different training set sizes

Experiments

.: Data Mining in Bioinformatics, Page 28

Average P/R-breakeven point on the WebKB category’project’ for different training set sizes

Summary

.: Data Mining in Bioinformatics, Page 29

ResultsTransductive version of SVMMaximizes margin on training and test dataImplementation uses variant of classic inductive SVMSolution is approximate and fastWorks well on text, in particular on small training sam-ples and large test sets

References and further reading

.: Data Mining in Bioinformatics, Page 30

References

[1] T.-Joachims. Transductive Inference for Text Classifica-tion using Support Vector Machines ICML, 1999: 200-209.

[2] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Elsevier, Morgan-Kaufmann Publishers,2006.

The end

.: Data Mining in Bioinformatics, Page 31

See you tomorrow! Next topic: Graph Mining

Recommended