Applied Machine Learning - molgen.mpg.deApplied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin

Applied Machine Learning

Annalisa MarsicoOWL RNA Bionformatics group

Max Planck Institute for Molecular GeneticsFree University of Berlin

27 Mai, SoSe 2015

Supervised vs Unsupervised Learning

Typical ScenarioWe have an outcome quantitative (price of a stock, risk factor..) orcategorical (heart attack yes or no) that we want to predict based on somefeatures. We have a training set of data and build a prediction model, a learner, able to predict the outcome of new unseen objects

- Supervised learning: the presence of the outcome variable is guiding thelearning process- Unsupervised learning: we have only features, no outcome

- Task is rather to describe the data

Supervised Learning

• Find the connection between two sets of observations: the input set and the output set

• Given {풙풏,푦 }, where 풙풏 ∈ 푷 (feature space) find a function f(from a set of hypothesis H), such that ∀푛 ∈ [1, . .푁]

푦 = 푓(풙풏)

f linear function, polynomial function, a classifier (e.g. logistic regression), a NN, a SVM, a random forest…

Unsupervised learning• find a structure in the data

• Given X ={xn} measurements / obervations /featuresfind a model M such that p(M|X) is maximizedi.e. Find the process that is most likely to have generate the data

Gaussian Mixture Models

푝 풙 = 흅 푁 풙 흁 ,휮 )

Gaussian Mixture models

• Choose a number of clusters 퐾• Initialize the 퐾 priors 휋 , the 퐾 means 휇 and the covariances휮

• Repeat until convergence

• compute the probability p(k|풙 ) of each datapoint 풙 to belong to cluster k

• update parameters of the model (cluster priors 휋 , mean휇 and covariances 휮 by taking the weighted average

number / location / variance of all points, where the weightof point 풙 is p(k|풙 ) )

EM algorithm

E-step

M-step

Gaussian Mixture models

Semi-supervised learning

• Middle-ground between supervised and unsupervised learning

• We are in the following situation:– Set of instances X ={xn} drawn from some unknown probability

distribution p(X)– Wish to learn a target function 푓:푋 ⟶ 푌 (or a lerner) given a set H of

possible hypothesis given:퐿 = { 푥 , 푦 , … 푥 , 푦 } labeled examples

U= { 푥 , 푦 , … 푥 , 푦 } unlabeled examplesUsually 푙 ≪ 푛

Wish to find the hypothesis (function) with the lowest error푓 = 푎푟푔푚푖푛 ∈ 퐸 푤 ,퐸 푤 ≡ 푃[ℎ 푥 ≠ 푓 푥 ]

Why the name?

Supervised learning• Classification, regression {(푥 : ,푦 : )}

Semi-supervised classification / regression푥 : ,푦 : ,푥 ,

Semi-supervised clustering 푥 : ,푦 : , 푥 ,

Unsupervised learning• Clustering, mixture models 푥 :

Why bother?

There are cases where labeled examples are really few• People want better performance ‘for free’• Unlabeled data are cheap, labeled data might be hard to

get– Drug design, i.e. effect of a drug on protein activity– Genomics: real binding sites vs non real binding sites– SNPs data for disease classification

Hard-to-get labelsImage categorization of “eclipse”: one could in principle label 1000+ manually

Hard-to-get labels

Nonetheless, the number of images to classify might be huge..

We will show how to improve classification by using the unlabeled examples

Semi-supervised mixture models

• First approach: modify the posterior probability p(k|풙 ) of each data point in the E-step (set p(k|풙 )=1 if point is labeled as positive, 0 otherwise). Example on the blackboard (miRNA promoters)

• Second approach: train classifier on labeled examples and estimate the parameters (M-step). Compute ‘expected’ class for unlabeled examples (E-step). Example on the blackboard.

How can unlabeled data ever help?

This is only one of the many ways to use unlabeled data

Why the name?

Supervised learning• Classification, regression {(푥 : ,푦 : )}

Semi-supervised classification / regression푥 : ,푦 : ,푥 ,

Semi-supervised clustering 푥 : ,푦 : , 푥 ,

Unsupervised learning• Clustering, mixture models 푥 :

Semi-supervised regression / classification

GoalUsing both labeled and unlabeled data to build better learners, than using each one alone

– Boost the performance of a learning algorithm when only a small set of labeled data is available

Example 1 – web page classification

We want a system which automatically classifies web pages into faculty (academic) web pages vs other pages

– Labeled examples: pages labeled by hand– Unlabeled examples: millions of pages on the web– Nice features: web pages can have multiple representations

(can be described by different kind of information)

Example1: Redundantly sufficient features

my advisorDr. Bernhard Renard

Setting: redundant features, i.e. description of each example can be partitioned into distinct views

Example: Redundantly sufficient features

my advisorDr. Bernhard Renard

Example: Redundantly sufficient features

Co-training, multi-view learning, Co-regularization – Main idea

In some settings data features are redundant– We can train different classifiers on disjoint features– Classifiers should agree on unlabeled examples– We can use unlabeled data to constraint joint training of both

classifiers– Different algorithms differ in the way they constraint the

classifiers

The Co-training algorithm – Variant 1

Assumptions:– Either view of the features is sufficient for the

learning task– Compatibility assumption (a strong one):

classifiers in each view agree on labels of most unlabeled examples

– Independence assumption: views are independent given the class labels (conditional independence)


Given:• Labeled data L• Unlabeled data U

Loop• Train g1 (hyperlink classifier) using L• Train g2 (page classifier) using L• Sample N1 points from U, let g1 label p1 positives and

n1 negatives• Sample N2 points from U, let g2 label p2 positives and

n2 negatives• Add self-labeled (N1+N2) examples to L

A. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’


Given:• Labeled data L• Unlabeled data U

Loop• Train g1 (hyperlink classifier) using L• Train g2 (page classifier) using L• Sample N1 points from U, let g1 label p1 positives and

n1 negatives• Sample N2 points from U, let g2 label p2 positives and

n2 negatives• Add self-labeled n1 < N1 (examples where g1 is more confident)

and n2 < N2 (examples where g2 is more confident) to LA. Blum & T. Mitchell, 1998 ‘Combining Labeled and Labeled Data with Co-training’

The Co-training algorithm – ResultsWeb-pages classification task

– Experiment: 1051 web pages– 203 (25%) left out (randomly) for testing– 12 out of the remaining 848 are labeled (L)– 5 experiments conducted using different test+training splits

A. Blum & T. Mitchell, 1998

Plot of the error on the test setvs the number of iterations. Thanksto co-training the errors converge after a while

The Co-training algorithm – Intuition and limitations

• If the hyperlinks classifier finds a page it is highly confident about, then it can share it with the other classifier (the page classifier). This is how they benefit from each other.

• Starting from a set of labeled examples, they co-train each other at each iteration (defined as ‘greedy’ EM)

• Limitation: even if each classifier picks examples he is more confident about, it ignores the other view

Multi-view learning (Co-regularization)- Motivation

• It relaxes the assumption of compatibility

– If classifiers don’t agree on unlabeled examples, then we have noisy data.

– What is a way to reduce noise (variance) in the data?

Multi-view learning (Co-regularization)

Framework where classifiers are learnt in each view through forms of multi-view regularization

– We are in the case where 푔 푥 ≠ 푔 (푥)– Joint regularization to minimize the disagreement between them

< 휃 ,휃 >← 푎푟푔푚푖푛( , ) (푦 −푔 (푥 ;휃 ))∈

+

+ (푦 −푔 (푥 ;휃 )) +∈

+ (푔 푥 ;휃 − 푔 푥 ;휃 )∈

Penalizationterm

Co-regularization – Intuition and Limitations

• Build on the assumption that instead of making g1 and g2agree on examples afterwards, must agree into the objective function we are optimizing.

• Unlabeled incorporated into regularization

• The algorithms are not greedy (Mmm..)

• I can decide how much compatibility I want to force (i.e. how much I want to penalize disagreement). Any idea how?

• Limitations: No good implementation so far

Essential ingredients for Co-training / Co-regularization - Summary

• A big number of unlabeled examples• Multi-view (particularly suitable for biology)• Conditional independence• A good way to solve your optimization

problem when you have joint regularization

Semi-supervised learning in Biology

• Regulatory elements prediction (e.g. promoter)

• Protein function prediction• Diseases hard to diagnose (few labels)• Long non-coding RNA function prediction• miRNA target prediction..

Documents

Applied Machine Learning - molgen.mpg.deApplied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin