Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.:...

.: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 4: Text Mining

Karsten Borgwardt

February 21 to March 4, 2011

Machine Learning & Computational Biology Research GroupMPIs Tübingen

What is text mining?

DefinitionText mining is the use of automated methods for exploit-ing the enormous amount of knowledge available in the(biomedical) literature.

MotivationMost knowledge is stored in terms of texts, both in in-dustry and in academiaThis alone makes text mining an integral part of knowl-edge discovery!Furthermore, to make text machine-readable, one hasto solve several recognition (mining) tasks on text

What is text mining?

Common tasksInformation retrieval: Find documents that are relevantto a user, or to a query in a collection of documents

Document ranking: rank all documents in the collec-tionDocument selection: classify documents into relevantand irrelevant

Information filtering: Search newly created documentsfor information that is relevant to a userDocument classification: Assign a document to a cate-gory that describes its contentKeyword co-occurrence: Find groups of keywords thatco-occur in many documents

Evaluating text mining

Precision and RecallLet the set of documents that are relevant to a querybe denoted as {Relevant} and the set of retrieved doc-uments as {Retrieved}.The precision is the percentage of retrieved documentsthat are relevant to the query

precision =|{Relevant} ∩ {Retrieved}|

|{Retrieved}|(1)

The recall is the percentage of relevant documents thatwere retrieved by the query:

recall =|{Relevant} ∩ {Retrieved}|

|{Relevant}|(2)

Text representation

TokenizationProcess of identifying keywords in a documentNot all words in a text are relevantText mining ignores stop wordsStop words form the stop listStop lists are context-dependent

Text representation

Vector space modelGiven #d documents and #t termsModel each document as a vector v in a t-dimensionalspace

Weighted Term-frequency matrixMatrix TF of size #d×#t

Entries measure association of term and documentIf a term t does not occur in a document d, thenTF (d, t) = 0

If a term t does occur in a document d, then TF (d, t) >0.

Text representation

If term t occurs in document d, thenTF (d, t) = 1

TF (d, t) = frequency of t in d (freq(d, t))

TF (d, t) = freq(d,t)∑t′∈T freq(d,t

TF (d, t) = 1 + log(1 + log(freq(d, t)))

Text representation

Inverse document frequencyrepresents the scaling factor, or importance, of a termA term that appears in many document is scaled down

IDF (t) = log1 + |d||dt|

where |d| is the number of all documents, and |dt| is thenumber of documents containing term t

TF-IDF measureProduct of term frequency and inverse document fre-quency:

TF -IDF (d, t) = TF (d, t)IDF (t); (4)

Measuring similarity

Cosine measureLet v1 and v2 be two document vectors.The cosine similarity is defined as

sim(v1, v2) =v>1 v2|v1||v2|

Kernelsdepending on how we represent a document, there aremany kernels available for measuring similarity of theserepresentations

vectorial representation: vector kernels like linear,polynomial, Gaussian RBF kernelone long string: string kernels that count common k-mers in two strings (more on that later in the course)

Keyword co-occurrence

ProblemFind sets of keyword that often co-occurCommon problem in biomedical literature: find associ-ations between genes, proteins or other entities usingco-occurrence searchKeyword co-occurrence search is an instance of a moregeneral problem in data mining, called association rulemining.

Association rules

DefinitionsLet I = {I1, I2, . . . , Im} be a set of items (keywords)Let D be the database of transactions T (collection ofdocuments)A transaction T ∈ D is a set of items: T ⊆ I (a docu-ment is a set of keywords)Let A be a set of items: A ⊆ T . An association rule isan implication of the form

A ⊆ T ⇒ B ⊆ T, (6)

where A,B ⊆ I and A ∩B = ∅

Association rules

Support and ConfidenceThe rule A⇒ B holds in the transaction set D with sup-port s, where s is the percentage of transactions in Dthat contain A ∪B:

support(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}|

|{T ∈ D}|(7)

The rule A ⇒ B has confidence c in the transactionset D, where c is the percentage of transactions in Dcontaining A that also contain B:

confidence(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}||{T ∈ D|A ⊆ T}|

Association rules

Strong rulesRules that satisfy both a minimum support thresh-old (minsup) and a minimum confidence threshold(minconf) are called strong association rules— andthese are the ones we are after!

Finding strong rules1. Search for all frequent itemsets (set of items that occur

in at least minsup % of all transactions)2. Generate strong association rules from the frequent

itemsets

Association rules

Apriori algorithmMakes use of the Apriori property: If an itemset A isfrequent, then any subset B of A (B ⊆ A) is frequentas well. If B is infrequent, then any superset A of B(A ⊇ B) is infrequent as well.

Steps1. Determine frequent items = k-itemsets with k = 1

2. Join all pairs of frequent k-itemsets that differ in at most1 item = candidatesCk+1 for being frequent k+1 itemsets

3. Check the frequency of these candidates Ck+1: the fre-quent ones form the frequent k + 1-itemsets (trick: dis-card any candidate immediately that contains an infre-quent k-itemset)

4. Repeat from Step 2 until no more candidate is frequent

Transduction

Known test setClassification on text databases often means that weknow all the data we will work with before trainingHence the test set is known aprioriThis setting is called ’transductive’Can we define classifiers that exploit the known test set?Yes!

Transductive SVM (Joachims, ICML 1999)Trains SVM on both training and test setUses test data to maximise margin

Inductive vs. transductive

ClassificationTask: predict label y from features x

Classic inductive settingStrategy: Learn classifier on (labelled) training dataGoal: Classifier shall generalise to unseen data fromsame distribution

Transductive settingStrategy: Learn classifier on (labelled) training dataAND a given (unlabelled) test datasetGoal: Predict class labels for this particular dataset

Why transduction?

Really necessary?Classic approach works: train on training dataset, teston test datasetThat is what we usually do in practice, for instance, incross-validation.We usually ignore or neglect that the fact that settingsare transductive.

The benefits of transductive classificationInductive setting: infinitely many potential classifiersTransductive setting: finite number of equivalenceclasses of classifiersf and f ′ in same equivalence class⇔ f and f ′ classifypoints from training and test dataset identically

Why transduction?

Idea of Transductive SVMsRisk on Test data ≤ Risk on Training data + confidenceinterval (depends on number of equivalence classes)Theorem by Vapnik(1998): The larger the margin, thelower the number of equivalence classes that contain aclassifier with this marginFind hyperplane that separates classes in training dataAND in test data with maximum margin.

Why transduction?

Transduction on text

Transductive SVM

Linearly separable case

minw,b,y∗

2‖w‖2

s.t. ∀ni=1 yi[w>xi + b] ≥ 1

∀kj=1 y∗j [w>x∗j + b] ≥ 1

Transductive SVM

Non-linearly separable case

minw,b,y∗,ξ,ξ∗

2‖w‖2 + C

n∑i=0

ξi + C∗k∑j=0

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

∀ni=1 ξi ≥ 0

∀kj=1 ξ∗j ≥ 0

Transductive SVM

OptimisationHow to solve this OP?Not so nice: combination of integer and convex OPJoachims’ approach: find approximate solution by itera-tive application of inductive SVM

train inductive SVM on training data, predict on testdata, assign labels to test dataretrain on all data, with special slack weights for testdata (C∗−, C

Outer loop: repeat and slowly increase (C∗−, C∗+)

Inner loop: within each repetition switch pairs of ’mis-classified’ data points repeatedly

Local search with approximate solution to OP

Inductive SVM for TSVM

Variant of inductive SVM

minw,b,y∗,ξ,ξ∗

2‖w‖2 + C

n∑i=0

ξi + C∗−

k∑j:y∗j=−1

ξ∗j + C∗+

k∑j:y∗j=1

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

Three different penalty costsC for points from training datasetC∗− for points from in test dataset currently in class −1C∗+ for points from in test dataset currently in class +1

Experiments

Average P/R-breakeven point on the Reuters dataset fordifferent training set sizes and a test size of 3,299

Experiments

Average P/R-breakeven point on the Reuters dataset for 17training documents and varying test set size for the TSVM

Experiments

Average P/R-breakeven point on the WebKB category’course’ for different training set sizes

Experiments

Average P/R-breakeven point on the WebKB category’project’ for different training set sizes

Summary

ResultsTransductive version of SVMMaximizes margin on training and test dataImplementation uses variant of classic inductive SVMSolution is approximate and fastWorks well on text, in particular on small training sam-ples and large test sets

References and further reading

References

[1] T.-Joachims. Transductive Inference for Text Classifica-tion using Support Vector Machines ICML, 1999: 200-209.

[2] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Elsevier, Morgan-Kaufmann Publishers,2006.

The end

See you tomorrow! Next topic: Graph Mining

Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.:...

Documents

APPLICATION OF DATA MINING TECHNIQUES IN BIOINFORMATICSethesis.nitrkl.ac.in/4154/1/Application_of_Data_Mining_Techniques... · i application of data mining techniques in bioinformatics

6th International Workshop on Data Mining in Bioinformatics …home.biokdd.org/biokdd07/BIOKDD06_proceeding.pdf · 2017-03-01 · † Text mining in bioinformatics † Modeling of

Social Media Mining for Public Health - DIEGO Labdiego.asu.edu/asarker/maricopa_county_presentation.pdf · Image Processing Information Retrieval Bioinformatics ... Social Media Mining

BMC Bioinformatics BioMed Centralbpg.utoledo.edu/~afedorov/ABPG2011/L19/MV_a1_study_q1.pdfBMC Bioinformatics Software Open Access Mining gene expression data by interpreting principal

Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Chapter 16: Text Mining for Translational Bioinformatics 1

Data Mining in Bioinformatics Erwin M. Bakker LIACS Leiden University

Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

Data Mining in Bioinformatics

Bioinformatics and data mining: application in dairy ... · Bioinformatics and data mining: application in dairy cattle nutrition and physiology ... Enrichment tools systematically

TEXT MINING (2005) TEXT MINING Bioinformatics and Computational Biology Summer School – University Complutense of Madrid

Bioinformatics Lab. Centrality and Graph Mining. Bioinformatics Lab. Introduction Many real world systems can be described as networks. Social relationships:

Data Mining in Bioinformatics Day 8: Graph Mining for

Bioinformatics: Practical Application of Simulation and Data Mining Protein Folding I

ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §schmidtc@udel.edu §

Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February

Data Mining in Bioinformatics Day 1: Classiﬁcation · 2014-10-29 · Our course Karsten Borgwardt: Data Mining in Bioinformatics, Page 2 Schedule February 18 to March 1 Lecture

LNBIP 107 - When Process Mining Meets Bioinformatics · When Process Mining Meets Bioinformatics R.P.JagadeeshChandraBose 1,2 andWilM.P.vanderAalst 1 DepartmentofMathematicsandComputerScience,UniversityofTechnology,

Bioinformatics: Practical Application of Simulation and Data Mining Protein Folding II

Extracting information from European Bioinformatics ...€¦ · European Bioinformatics Institute Clustering methods ... High-throughput methods (“data mining”) European Bioinformatics