Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Preview:

Citation preview

Learning from Big Data Lecture 5

M. Pawan Kumar

http://www.robots.ox.ac.uk/~oval/

Slides available online http://mpawankumar.info

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results

Outline

Is this an urban or rural area?

Input: x Output: y {-1,+1}

Image Classification

Is this scan healthy or unhealthy?

Input: x Output: y {-1,+1}

Image Classification

y

xObserved input

Unobserved output

Label -1

Label +1Probabilistic

GraphicalModel

Image Classification

Feature Vector

x

FeatureΦ(x)

conv1 conv2 conv3 conv4 conv5fc6

fc7

Feature Vector

x

FeatureΦ(x)

Pre-Trained CNN

Joint Feature Vector

Input: x Output: y {-1,+1}

Ψ(x,y)

Joint Feature Vector

Input: x Output: y {-1,+1}

Ψ(x,-1)

Φ(x)

0

=

Joint Feature Vector

Input: x Output: y {-1,+1}

Ψ(x,+1)

0

Φ(x)

=

Score Function

Input: x Output: y {-1,+1}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction

Input: x Output: y {-1,+1}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

Maximize the score over all possible outputs

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning

• Structured Output SVM

• Optimization

• Results

Outline

Which city is this?

Input: x Output: y {1,2,…,C}

Image Classification

What type of tumor does this scan contain?

Input: x Output: y {1,2,…,C}

Image Classification

y

xObserved input

Unobserved output

123

C

GraphicalModel

Image Classification

conv1 conv2 conv3 conv4 conv5fc6

fc7

Feature Vector

x

FeatureΦ(x)

Pre-Trained CNN

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,y)

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,1)

Φ(x)

0=

.

.

.

0

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,2)

0

Φ(x)=

.

.

.

0

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,C)

0

=

.

.

.

Φ(x)

0

Where is the object in the image?

Input: x Output: y {Pixels}

Object Detection

Where is the rupture in the scan?

Input: x Output: y {Pixels}

Object Detection

y

xObserved input

Unobserved output

123

C

GraphicalModel

Object Detection

conv1 conv2 conv3 conv4 conv5fc6

fc7

Joint Feature Vector

x

Ψ(x,y)

Pre-Trained CNNy

conv1 conv2 conv3 conv4 conv5fc6

fc7

Joint Feature Vector

x

Ψ(x,y)

Pre-Trained CNNy

conv1 conv2 conv3 conv4 conv5fc6

fc7

Joint Feature Vector

x

Ψ(x,y)

Pre-Trained CNNy

Score Function

Input: x Output: y {1,2,…,C}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction

Input: x Output: y {1,2,…,C}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

Maximize the score over all possible outputs

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning

• Structured Output SVM

• Optimization

• Results

Outline

What is the semantic class of each pixel?

Input: x Output: y {1,2,…,C}m

car

roadgrass

treesky

sky

Segmentation

What is the muscle group of each pixel?

Input: x Output: y {1,2,…,C}m

Segmentation

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

GraphicalModel

Segmentation

conv1 conv2 conv3 conv4 conv5fc6

fc7

FeatureΦ(x1)

Pre-Trained CNNx1

Feature Vector

Joint Feature Vector

Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,1)

Φ(x1)

0=

.

.

.

0

Joint Feature Vector

Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,2)

0

Φ(x1)=

.

.

.

0

Joint Feature Vector

Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,C)

0

=

.

.

.

Φ(x1)

0

conv1 conv2 conv3 conv4 conv5fc6

fc7

FeatureΦ(x2)

Pre-Trained CNNx2

Feature Vector

Joint Feature Vector

Input: x2 Output: y2 {1,2,…,C}

Ψu(x2,1)

Φ(x2)

0=

.

.

.

0

Joint Feature Vector

Input: x2 Output: y2 {1,2,…,C}

Ψu(x2,2)

0

Φ(x2)=

.

.

.

0

Joint Feature Vector

Input: x2 Output: y2 {1,2,…,C}

Ψu(x2,C)

0

=

.

.

.

Φ(x2)

0

Overall Joint Feature Vector

Input: x Output: y {1,2,…,C}m

Ψu(x,y)

Ψu(x1,y1)

=

.

.

.

Ψu(xm,ym)

Ψu(x2,y2)

Score Function

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: → (-∞,+∞) wTΨu(x,y)

Prediction

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy f(Ψu(x,y))

→ (-∞,+∞)

Prediction

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy wTΨu(x,y)

→ (-∞,+∞)

Prediction

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy ∑a (wa)TΨu(xa,ya)

Maximize for each a {1,2,…,m} independently

→ (-∞,+∞)

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

GraphicalModel

Segmentation

Unary Joint Feature Vector

Input: x Output: y {1,2,…,C}m

Ψu(x,y)

Ψu(x1,y1)

=

.

.

.

Ψu(xm,ym)

Ψu(x2,y2)

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Pairwise Joint Feature Vector

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Ψp(x12,y12) = δ(y1=y2)

Pairwise Joint Feature Vector

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Ψp(x23,y23) = δ(y2=y3)

Pairwise Joint Feature Vector

Input: x Output: y {1,2,…,C}m

Ψp(x,y)

Ψp(x12,y12)

=

.

.

.

Ψp(x23,y23)

Pairwise Joint Feature Vector

Overall Joint Feature Vector

Input: x Output: y {1,2,…,C}m

Ψ(x,y)Ψu(x,y)

=Ψp(x,y)

Score Function

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

→ (-∞,+∞)

Prediction

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy wTΨ(x,y)

→ (-∞,+∞)

Prediction

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy ∑a (wa)TΨu(xa,ya)

→ (-∞,+∞)

+ ∑a,b (wab)TΨp(xab,yab)

Week 5 “Optimization” lectures

Input x,Outputs{y1,y2,..}

FeaturesΨ(x,yi)

Scoresf(Ψ(x,yi))

Extract Features

ComputeScores

maxyi f(Ψ(x,yi))

Predictiony(f)

How do I fix “f”?

Summary

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning

• Structured Output SVM

• Optimization

• Results

Outline

Data distribution P(x,y)

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Measure of prediction quality

Distribution is unknown

Expectation overdata distribution

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Measure of prediction quality

Expectation overdata distribution

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

Prediction

f* = argminf Σi Error(yi(f),yi)

Ground Truth

Measure of prediction quality

Expectation overempirical distribution

Finite samples

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

RegularizerRelative weight(hyperparameter)

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

Learning Objective

Error can be negative log-likelihood

Probabilistic model

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results

Outline

Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004

Score Function and Prediction

Input: x Output: y

Joint feature vector of input and output: Ψ(x,y)

f(Ψ(x,y)) = wTΨ(x,y)

Prediction: maxy wTΨ(x,y)

Predicted Output: y(w) = argmaxy wTΨ(x,y)

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Classification loss?

User specified

“New York” 0

“Paris” 1

Δ(y,y(w)) = δ(y=y(w))

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Detection loss?

User specified

Overlap score

Area of intersection

Area of union

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Segmentation loss?

User specified

car

roadgrass

treesky Fraction of incorrect pixels

Micro-average

Macro-average

Training data {(xi,yi), i = 1,2,…,n}

Δ(yi,yi(w))

Loss function for i-th sample

Minimize the regularized sum of loss over training data

Highly non-convex in w

Regularization plays no role (overfitting may occur)

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

Δ(yi,yi(w))wTΨ(xi,yi(w)) + - wTΨ(xi,yi(w))

≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi)

≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi)

ConvexSensitive to regularization of w

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Learning Objective

Quadratic program with large number of constraints

Many polynomial time algorithms

• Structured Output Prediction

• Structured Output SVM

• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Shalev-Shwartz et al. Mathematical Programming 2011

Convex function g(z)

Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

g(z) = z2

Gradient? 2z0

minz g(z)

Gradient DescentStart at some point z0

g(z) = z2

Move along the negative gradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search

Convex function g(z)

Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not exist

g(z) = |z|

s?

Convex function g(z)

Subgradient

Subgradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not be unique

g(z) = |z|

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search

g(z) = |z|

Doesn’t always work

minz max{z2 + 2z1, z2 - 2z1}

Subgradient Descent

g(z) = 5

g(z) = 4

g(z) = 3

z1

z20

5

-2

1

5+3λ5

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search

g(z) = |z|

Doesn’t always work

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction

zt+1 ← zt – λtg’(zt) limT→∞∑1T λt = ∞

g(z) = |z|

Convergence

limt→∞ λt = 0

Training data {(xi,yi), i = 1,2,…,n}

wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Learning Objective

Constrained problem?

Training data {(xi,yi), i = 1,2,…,n}

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Learning Objective

Subgradient?

g(z) – g(z0) ≥ sT(z-z0)

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Subgradient

Ψ(xi,y) - Ψ(xi,yi)

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Proof?

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference

Inference

Classification inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

Output: y {1,2,…,C}

Brute-force search

Inference

Detection inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

Output: y {1,2,…,C}

Brute-force search

Inference

Segmentation inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

car

roadgrass

treeskymaxy ∑a (wa)TΨu(xi

a,ya)

+ ∑a,b (wab)TΨp(xiab,yab)

+ ∑a Δ(yia,ya)

Week 5 “Optimization” lectures

Subgradient Descent

Start at some parameter w0

For t = 0 to T

End

s = 2wt

For i = 1 to n

// Number of iterations

// Number of samples

End

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Subgradient Descent

Start at some parameter w0

For t = 0 to T

End

s = 2wt

For i = 1 to n

// Number of iterations

// Number of samples

End

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Training data {(xi,yi), i = 1,2,…,n}

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Stochastic Approximation

Choose a sample ‘i’ with probability 1/n

Training data {(xi,yi), i = 1,2,…,n}

Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Stochastic Approximation

Choose a sample ‘i’ with probability 1/n

Expected value? Original objective function

Stochastic Subgradient Descent

Start at some parameter w0

For t = 0 to T

End

s = 2wt

Choose a sample ‘i’ with probability 1/n

// Number of iterations

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Convergence Rate

Compute an ε-optimal solution

C: SSVM hyperparameter

d: Number of non-zeros in the feature vector

O(dC/ε) iterations

Each iteration requires solving an inference problem

Side Note: Structured Output CNN

conv1 conv2 conv3 conv4 conv5fc6

fc7SSVM

Back-propagate the subgradients

• Structured Output Prediction

• Structured Output SVM

• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Lacoste-Julien et al. ICML 2013

Slide courtesy Martin Jaggi

Conditional Gradient

Slide courtesy Martin Jaggi

Conditional Gradient

Slide courtesy Martin Jaggi

Conditional Gradient

Slide courtesy Martin Jaggi

Conditional Gradient

SSVM Primal

wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Derive dual on board

SSVM Dual

∑y αi(y) = C

maxα ||Mα||2/4 + bTα

for all i

αi(y) ≥ 0 for all i, y

w = Mα/2

bT = [Δ(yi,y)]

Linear Program

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i

αi(y) ≥ 0 for all i, y

Solve this over all possible α

Standard Frank-Wolfe

Solve this over all possible αi for a sample ‘i’

Block Coordinate Frank-Wolfe

Linear Program

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i

αi(y) ≥ 0 for all i, y

Vertices?

αi(y) =C, if y = ŷ

0, otherwise

Solution

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i

αi(y) ≥ 0 for all i, y

ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)}

si(y) =C, if y = ŷ

0, otherwise

Inference

Which one maximizes the linear function?

Update

αt+1 = (1-μ) αt + μs

Standard Frank-Wolfe

s contains the solution for all the samples

Block Coordinate Frank-Wolfe

s contains the solution for sample ‘i’

sj = αtj for all other samples

Step-Size

αt+1 = (1-μ) αt + μs

Maximizing a quadratic function in one variable μ

Analytical computation of optimal step-size

Comparison

OCR Dataset

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

Optical Character Recognition

Identify each letter in a handwritten word

Taskar, Guestrin and Koller, NIPS 2003

Optical Character Recognition

Taskar, Guestrin and Koller, NIPS 2003

X1 X2 X3 X4

Labels L = {a, b, …., z}

Logistic Regression Multi-Class SVM

Optical Character Recognition

Taskar, Guestrin and Koller, NIPS 2003

X1 X2 X3 X4

Labels L = {a, b, …., z}

Maximum Likelihood Structured Output SVM

Optical Character Recognition

Taskar, Guestrin and Koller, NIPS 2003

Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

X1 X2 X3

X4 X5 X6

X7 X8 X9

Labels L = {0, 1}

Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

Unary Max Likelihood SSVM0

5

10

15

20

25

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

Scene Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP9.6

9.8

10

10.2

10.4

10.6

10.8

11

11.2

11.4

Reuters Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP0

2

4

6

8

10

12

14

16

18

Yeast Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP0

5

10

15

20

25

30

35

40

45

50

Mediamill Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP0

5

10

15

20

25

30

35

40

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

“Jumping” Classification

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Threshold the score for classification

“Jumping” RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

Ranking vs. ClassificationRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1= 0.92 = 0.67= 0.81

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Sort the score for ranking

Computes subgradients of the AP loss

Train

ing

Tim

e

0-1

AP

5x slower

Yue, Finley, Radlinski and Joachims, SIGIR 2007

Avera

ge P

reci

sion

0-1 AP

4% improvementfor free

Efficient Optimization ofAverage Precision

Pritish Mohapatra C. V. Jawahar M. Pawan Kumar

Train

ing

Tim

e

0-1

AP

5x slowerAP

Slightly faster

Each iteration for AP optimization is slightly slower

It takes fewer iterations to converge in practice

Questions?

Recommended