Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured...

Lecture 10Kernel Methods for Structured Outputs

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems GroupWilhelm Schickard Institute for Computer ScienceUniversitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 1 / 24

What is Structured Outputs?

Multiclass classification

Parsing

��

� ��

� �

Syntactic alignment

��

�� !��

� �

��

�� !��

(�)!��%��&��

Label sequence learning

Multi-Class Classification: One-vs-the-Rest

Advantages:

Small number of classifiers

Rejection possible

Disadvantages:

Complex classifier boundaries

Unbalanced classification problems

Pairwise Multi-Class Classification

Disadvantages:

Large number of classifiers

Rejection not possible

Advantages:

Simple classifier boundaries

Balanced classification problems

Multi-Class Classification: Distributed Encoding

Encode classes with binary words withsufficient separation between them.

Train binary classifiers corresponding tothe columns of the code matrix.

Apply all classifiers at the decision stage,decide by majority vote.

If the minimal Hamming distance betweencode words is d then ⌊d/2⌋ errors can becorrected.

code wordclass 1 2 3 4 5

1 0 0 0 1 12 0 0 1 0 13 0 1 0 0 14 0 1 1 0 05 1 0 0 1 06 1 1 0 1 17 1 1 1 0 18 1 1 1 1 0

Advantages:

Small number of classifiers

Balanced classification problems

Disadvantages:

Comples classifier boundaries

Rejection not possible

Multi-Class SVM

Consider the decision function of a one-vs-the-rest classifier:

f (x) = argmaxm∈{1,...,M}

Key idea: build this constraint into the training problem for eachexample with known label:

(wm · wm)

subject to: (wyi · xi) + byj ≥ maxm 6=yi

(wm · xi) + bm, ∀i

Multi-Class SVM (ctd.)

The training problem can be adjusted to “soft-margin” and transformedinto the following form:

minw ,ξ

(wm · wm) + C

m 6=yi

subject to: (wyi · xi ) + byi ≥ (wm · xi ) + bm + 2− ξim, ∀i ,m 6= yi

ξim ≥ 0, ∀i ,m 6= yi

Single quadratic problem with 2N(M − 1) linear constraints.

Multi-Class SVM (ctd.)

A dual formulation of the multi-class SVM can be obtained by thestandard technique of Lagrange multipliers:

αim −∑

i ,j ,m

21yi=yjAiAj − αimαjyi +

2αimαjm

(xi · xj)

subject to:N∑

αim =N∑

1yi=ymAi , m = 1, . . . ,M

0 ≤ αim ≤ C , αiyi = 0

where Ai =∑M

m=1 αim.

Multi-Class SVM: Summary

Generalized notion of the margin: the difference in classification score tothe example with the nearest classification score.

A single optimization problem accounts for interaction between variousclasses.

Kernelization is straightforward.

/ The size of the optimizaiton problem is multiplied by the number ofclasses.

Structured Output Learning: Preliminaries

Output values y belong to a discrete space Y (e.g., parse trees).

Goal: given the input {(x1, y1), . . . , (xN , yN )}, find a functionf : X → Y, which represents the dependence of y on x.

How can we define a function with the range in Y?

Detour: Suppose we could learn some other function F : X × Y → R.Then we can define the learning function as:

f = argmaxy∈Y

F (x, y)

Structured Output Learning: Problem Setup

Assume F to be linear in some joint space of inputs and outputs:

F (x, y) = 〈w,Ψ(x, y)〉

For each example, define its margin as the prediction difference to some“runner-up” example:

Mi = 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉

F (x, y) = 〈w,Ψ(x, y)〉

For each example, define its margin as the prediction difference to some“runner-up” example:

Mi = 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉

Define the learning problem as weight minimization subject to marginconstraints:

2||w||2

subject to: Mi ≥ 1, ∀i

Re-inventing the Wheel: Multi-Class SVM

The single-problem multi-class SVM is a special case of structured outputlearning with the following definitions:

w = [w1, . . . ,wM ]

Ψ(x, y) = [1y=1x, . . . 1y=M , x]

||w||2 =∑M

m=1(wm ·wm)

Mi = 〈wyi , xi 〉 − maxm 6=yi

〈wym , xi 〉

Non-trivial Example: Learning to Parse

For strongly structured problems, fixing point-wise margins to 1 (i.e.,requiring theMi ≥ 1 may be too rigid.

Solution: introduce the loss function ∆(y, yi ) and change point-wisemargins to:

Mi ≥ 1−ξi

∆(y, yi )

Algorithmics: The Devil Lies in Detail

How can we solve optimization problems in the form

2||w||2

subject to: 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉 ≥ 1, i = 1, . . . ,N ?

Algorithmic challenges:

Constraints are rather complex (convex but non-differentiable).

Previous solution – replacing each constraint by |Y| ones – does notwork for infinite Y.

Sketch of the Algorithm

Main idea: de-couple optimization from finding the max in each of the Nconstraints.

Algorithm 1 Structured Output SVM

input {(x1, y1), . . . , (xN , yN)}, Ψ, ǫS ← ∅repeat

for i = 1, . . . ,N do

y← argmaxy∈Y\yi

〈w,Ψ(xi , y)〉

S ← S ∪ (xi , y)end for

(w, ξ)← solution to QP only with constraints from Suntil S does not change anymore

Solving the argmax Problems

Recall that the argmax “comes from the classification stage” andcorresponds to finding the most likely output, given a certain model, for agiven input.

Problem-specific solutions:

Multi-class classification: trivial, O(M).

Label sequence learning: HMM prediction problem, Viterbi algorithm(dynamic programming), O(|y|2).

Sequence alignment: Smith-Waterman algorithm, O(|x|2).

Learning to parse: CKY-parser, O(|x|3 · |G |), where |G | is the size of thegrammar.

Complexity Analysis

Theorem 1 (Tsochantaridis et al.)

The Structured Output SVM reaches precision ǫ after at most

ǫ,8C∆3R2

iterations, where

∆ = maxi

∆(y, yi )

R = maxi

maxy||Ψ(xi , yi )−Ψ(xi , y)||

Complexity Analysis

Proof sketch.The main idea is to lower-bound the progress in objective function by optimizing alongthe direction corresponding to the added example. It follows that, by re-optimizing thewhole working set, at least as much can be reached.

The increase of the objective along some direction η can be bounded as:

max0<β≤D

{Θ(α0 + βη)} −Θ(α0) ≥1

∇Θ(α0), η⟩

ηTJη

∇Θ(α0), η⟩

By fixing the direction η to be er (unit vector of the added example r), we have:

max0<β≤D

{Θ(α0 + βer )} −Θ(α0) ≥1

∂Θ∂αr

∂αr

The result follows from substituting specific values for Jrr and ∂Θ∂αr

(α0) and inverting thebound.

Performance Measures in Machine Learning

Confusion matrix:

Truelabels

Assigned labels+1 −1

+1 TP FN−1 FP TN

Classification error:

E =FP + FN

TP + FP + TN + FN

Neyman-Pearson errors:

FNR =FN

TP + FN, FPR =

FP + TN

Precision/Recall:

TP + FP, R =

TN + FN

Performance Measures in Machine Learning (ctd.)

Receiver Operating Characteristic (ROC) and Precision/Recall curves:

Area under Curve (AUC, prAUC): untegral measures of ROC or PR-curves

F1-measure:

F1 =2PR

Precision at k : precision with exactly k positive predictions.

Optimizing Performance Measures

Traditional learning methods optimize the classification error.

Some performance measures are linear, and hence easy to optimize:

⇒ classification error⇒ Neyman-Pearson errors

Others are nonlinear...

⇒ precision/recall⇒ F1-measure

...or multivariate

⇒ ROC-based measures depend on the ranking of all data points

Optimizing Performance Measures with SO-SVM

Main idea: instead of learning a single-valued function f : X → Y,consider the problem of learning all labels on all data points:

f : X = XN −→ Y = {−1,+1}N

Define the mapping

Ψ(x, y) =N∑

Learn the linear discriminative function of structured output SVM:

fw(x) = argmaxy∈Y

〈w,Ψ(x, y)〉

Algorithmic Implementation

All we need to care about is the argmax problem!

For a problem with N training points there are at most O(N2) differentconfusion matrices.

⇒ For performance measures based on a confusion matrix, the argmax can becomputed explicitly in O(N2) time.

For a problem with N training points there are at most O(N2) differentconfusion matrices.

⇒ For performance measures based on a confusion matrix, the argmax can becomputed explicitly in O(N2) time.

The number of swapped labels can be computed in O(N logN) bysorting continuous classification scores:

⇒ For performance measurse based on a ROC, the argmax can be computed inO(N logN) time.

Summary

The framework of structured output learning allows one to extend theideas of kernels methods to a large number of applications.

Algorithmics of structured output learning are based on alternation ofoptimization and the applying the classification function (argmaxproblem).

Using structured output learning, one can directly optimize variousperformance measures for binary classification problems.

Bibliography I

[1] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems viaerror-correcting output codes. Journal of Artificial Intelligence Research,2:263–286, 1995.

[2] T. Joachims. A support vector method for multivariate performancemeasures. In International Conference on Machine Learning (ICML), pages377–384, 2005.

[3] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large marginmethods for structured and interdependent output variables. Journal ofMachine Learning Research, 6:1453–1484, 2005.

[4] J. Weston and C. Watkins. Support vector machines for multi-class patternrecognition. In European Symposium on Artifician Neural Networks(ESANN), pages 219–224, 1999.

Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured...

Documents

Machine learning of structured outputs

Learning to Rankdilan/sharefiles/ChuGhahramani2005b.pdf · Klaus Brinker and Eyke Hullermeier¨ Ranking as Learning Structured Outputs 7 Christopher J.C. Burges Exploiting Hyperlinks

Inside the NVIDIA Ampere Architecture · 2020-05-20 · L2 DRAM NVLINK kernel buffer A kernel buffer B kernel buffer A kernel buffer B kernel buffer C kernel buffer D kernel buffer

Kernel Architecture : UNIX Kernel

Log-Linear Models - Northeastern University · 2016-04-01 · Log-Linear Models with Structured Outputs Natural Language Processing CS 4120/6120—Spring 2016 Northeastern University

Modeling Rich Structured Data via Kernel Distribution Embeddingslsong/teaching/CSE6740fall15/lecture0.pdf · 2015-08-18 · 13 million wikipedia pages 800 million users 6 billion

Learning with Structured Inputs and Outputs - WebHome · outputy isarealnumber I classiﬁcation,regression,densityestimation,... Structured Output Learning: f: X→Y. inputsXcanbeanykindofobjects

What’s New in Kext Development€¦ · Kext Development Overview •A kext is a bundle that extends the kernel Kext = Kernel Extension Only available for OS X, not iOS •Structured

Structured Gaussian Processes with Twin Multiple Kernel ...home.ku.edu.tr/~mehmetgonen/papers/ak_acml18_paper.pdf · The kernel functions are the basic building blocks of kernel-based

cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and Clustering with Kernels, Structured Output Learning

Kernel Methods for Structured Data · on kernels for structured data and suggest some basic principles for developing novel ones. We ﬁnally ... Inner products are also known as

Structured Predictions with Deep Learning€¦ · Structured outputs from deep learning •Outputs we’ve seen so far from CNN’s •Classification •Classification at every pixel

Smoothing Splines and Rank Structured Matrices: Revisiting the Spline … · Smoothing Splines and Rank Structured Matrices: Revisiting the Spline Kernel

Kernel and Non Kernel Clauses

Kernel Architecture : UNIX Kernel - NPTel

Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740/lecture18-convertGM.pdfReading: Chap 8, C. Bishop Book Undirected Graphical Models ... In a Bayesian

Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/8803ML/lecture12.pdf · Inference in graphical models Variable elimination, message passing on trees and junction

ENSEMBLES FOR PREDICTING STRUCTURED OUTPUTSkt.ijs.si/dragi_kocev/stuff/thesis_20110210.pdfDragi Kocev ENSEMBLES FOR PREDICTING STRUCTURED OUTPUTS Doctoral Dissertation ANSAMBLI ZA

T-Kernel Specification (1.00.00) - tron.org · T-Kernel/OS.....? T-Kernel? • • • • 1

Reconstructing Trust Management · tion and structured distributed naming. Given an access request and sup-porting credentials, the kernel determines whether the request is authorized