134
Learning from Big Data Lecture 5 M. Pawan Kumar http://www.robots.ox.ac.uk/ ~oval/ Slides available online http://mpawankumar.info

Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Embed Size (px)

Citation preview

Page 1: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Learning from Big Data Lecture 5

M. Pawan Kumar

http://www.robots.ox.ac.uk/~oval/

Slides available online http://mpawankumar.info

Page 2: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results

Outline

Page 3: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Is this an urban or rural area?

Input: x Output: y {-1,+1}

Image Classification

Page 4: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Is this scan healthy or unhealthy?

Input: x Output: y {-1,+1}

Image Classification

Page 5: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y

xObserved input

Unobserved output

Label -1

Label +1Probabilistic

GraphicalModel

Image Classification

Page 6: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Feature Vector

x

FeatureΦ(x)

Page 7: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

conv1 conv2 conv3 conv4 conv5fc6

fc7

Feature Vector

x

FeatureΦ(x)

Pre-Trained CNN

Page 8: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x Output: y {-1,+1}

Ψ(x,y)

Page 9: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x Output: y {-1,+1}

Ψ(x,-1)

Φ(x)

0

=

Page 10: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x Output: y {-1,+1}

Ψ(x,+1)

0

Φ(x)

=

Page 11: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Score Function

Input: x Output: y {-1,+1}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Page 12: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {-1,+1}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

Maximize the score over all possible outputs

Page 13: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning

• Structured Output SVM

• Optimization

• Results

Outline

Page 14: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Which city is this?

Input: x Output: y {1,2,…,C}

Image Classification

Page 15: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

What type of tumor does this scan contain?

Input: x Output: y {1,2,…,C}

Image Classification

Page 16: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y

xObserved input

Unobserved output

123

C

GraphicalModel

Image Classification

Page 17: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

conv1 conv2 conv3 conv4 conv5fc6

fc7

Feature Vector

x

FeatureΦ(x)

Pre-Trained CNN

Page 18: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,y)

Page 19: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,1)

Φ(x)

0=

.

.

.

0

Page 20: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,2)

0

Φ(x)=

.

.

.

0

Page 21: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x Output: y {1,2,…,C}

Ψ(x,C)

0

=

.

.

.

Φ(x)

0

Page 22: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Where is the object in the image?

Input: x Output: y {Pixels}

Object Detection

Page 23: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Where is the rupture in the scan?

Input: x Output: y {Pixels}

Object Detection

Page 24: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y

xObserved input

Unobserved output

123

C

GraphicalModel

Object Detection

Page 25: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

conv1 conv2 conv3 conv4 conv5fc6

fc7

Joint Feature Vector

x

Ψ(x,y)

Pre-Trained CNNy

Page 26: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

conv1 conv2 conv3 conv4 conv5fc6

fc7

Joint Feature Vector

x

Ψ(x,y)

Pre-Trained CNNy

Page 27: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

conv1 conv2 conv3 conv4 conv5fc6

fc7

Joint Feature Vector

x

Ψ(x,y)

Pre-Trained CNNy

Page 28: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Score Function

Input: x Output: y {1,2,…,C}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Page 29: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {1,2,…,C}

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

Maximize the score over all possible outputs

Page 30: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning

• Structured Output SVM

• Optimization

• Results

Outline

Page 31: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

What is the semantic class of each pixel?

Input: x Output: y {1,2,…,C}m

car

roadgrass

treesky

sky

Segmentation

Page 32: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

What is the muscle group of each pixel?

Input: x Output: y {1,2,…,C}m

Segmentation

Page 33: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

GraphicalModel

Segmentation

Page 34: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

conv1 conv2 conv3 conv4 conv5fc6

fc7

FeatureΦ(x1)

Pre-Trained CNNx1

Feature Vector

Page 35: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,1)

Φ(x1)

0=

.

.

.

0

Page 36: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,2)

0

Φ(x1)=

.

.

.

0

Page 37: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,C)

0

=

.

.

.

Φ(x1)

0

Page 38: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

conv1 conv2 conv3 conv4 conv5fc6

fc7

FeatureΦ(x2)

Pre-Trained CNNx2

Feature Vector

Page 39: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x2 Output: y2 {1,2,…,C}

Ψu(x2,1)

Φ(x2)

0=

.

.

.

0

Page 40: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x2 Output: y2 {1,2,…,C}

Ψu(x2,2)

0

Φ(x2)=

.

.

.

0

Page 41: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Joint Feature Vector

Input: x2 Output: y2 {1,2,…,C}

Ψu(x2,C)

0

=

.

.

.

Φ(x2)

0

Page 42: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Overall Joint Feature Vector

Input: x Output: y {1,2,…,C}m

Ψu(x,y)

Ψu(x1,y1)

=

.

.

.

Ψu(xm,ym)

Ψu(x2,y2)

Page 43: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Score Function

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: → (-∞,+∞) wTΨu(x,y)

Page 44: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy f(Ψu(x,y))

→ (-∞,+∞)

Page 45: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy wTΨu(x,y)

→ (-∞,+∞)

Page 46: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {1,2,…,C}m

Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy ∑a (wa)TΨu(xa,ya)

Maximize for each a {1,2,…,m} independently

→ (-∞,+∞)

Page 47: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

GraphicalModel

Segmentation

Page 48: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Unary Joint Feature Vector

Input: x Output: y {1,2,…,C}m

Ψu(x,y)

Ψu(x1,y1)

=

.

.

.

Ψu(xm,ym)

Ψu(x2,y2)

Page 49: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Pairwise Joint Feature Vector

Page 50: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Ψp(x12,y12) = δ(y1=y2)

Pairwise Joint Feature Vector

Page 51: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Ψp(x23,y23) = δ(y2=y3)

Pairwise Joint Feature Vector

Page 52: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Input: x Output: y {1,2,…,C}m

Ψp(x,y)

Ψp(x12,y12)

=

.

.

.

Ψp(x23,y23)

Pairwise Joint Feature Vector

Page 53: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Overall Joint Feature Vector

Input: x Output: y {1,2,…,C}m

Ψ(x,y)Ψu(x,y)

=Ψp(x,y)

Page 54: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Score Function

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Page 55: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

→ (-∞,+∞)

Page 56: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy wTΨ(x,y)

→ (-∞,+∞)

Page 57: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Prediction

Input: x Output: y {1,2,…,C}m

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy ∑a (wa)TΨu(xa,ya)

→ (-∞,+∞)

+ ∑a,b (wab)TΨp(xab,yab)

Week 5 “Optimization” lectures

Page 58: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Input x,Outputs{y1,y2,..}

FeaturesΨ(x,yi)

Scoresf(Ψ(x,yi))

Extract Features

ComputeScores

maxyi f(Ψ(x,yi))

Predictiony(f)

How do I fix “f”?

Summary

Page 59: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning

• Structured Output SVM

• Optimization

• Results

Outline

Page 60: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Data distribution P(x,y)

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Measure of prediction quality

Distribution is unknown

Expectation overdata distribution

Learning Objective

Page 61: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Measure of prediction quality

Expectation overdata distribution

Learning Objective

Page 62: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

Prediction

f* = argminf Σi Error(yi(f),yi)

Ground Truth

Measure of prediction quality

Expectation overempirical distribution

Finite samples

Learning Objective

Page 63: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

RegularizerRelative weight(hyperparameter)

Learning Objective

Page 64: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

Learning Objective

Error can be negative log-likelihood

Probabilistic model

Page 65: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results

Outline

Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004

Page 66: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Score Function and Prediction

Input: x Output: y

Joint feature vector of input and output: Ψ(x,y)

f(Ψ(x,y)) = wTΨ(x,y)

Prediction: maxy wTΨ(x,y)

Predicted Output: y(w) = argmaxy wTΨ(x,y)

Page 67: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Classification loss?

User specified

“New York” 0

“Paris” 1

Δ(y,y(w)) = δ(y=y(w))

Page 68: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Detection loss?

User specified

Overlap score

Area of intersection

Area of union

Page 69: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Segmentation loss?

User specified

car

roadgrass

treesky Fraction of incorrect pixels

Micro-average

Macro-average

Page 70: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

Δ(yi,yi(w))

Loss function for i-th sample

Minimize the regularized sum of loss over training data

Highly non-convex in w

Regularization plays no role (overfitting may occur)

Learning Objective

Page 71: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

Δ(yi,yi(w))wTΨ(xi,yi(w)) + - wTΨ(xi,yi(w))

≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi)

≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi)

ConvexSensitive to regularization of w

Learning Objective

Page 72: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Learning Objective

Quadratic program with large number of constraints

Many polynomial time algorithms

Page 73: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction

• Structured Output SVM

• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Shalev-Shwartz et al. Mathematical Programming 2011

Page 74: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Convex function g(z)

Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

g(z) = z2

Gradient? 2z0

Page 75: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

minz g(z)

Gradient DescentStart at some point z0

g(z) = z2

Move along the negative gradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search

Page 76: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Convex function g(z)

Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not exist

g(z) = |z|

s?

Page 77: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Convex function g(z)

Subgradient

Subgradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not be unique

g(z) = |z|

Page 78: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search

g(z) = |z|

Doesn’t always work

Page 79: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

minz max{z2 + 2z1, z2 - 2z1}

Subgradient Descent

g(z) = 5

g(z) = 4

g(z) = 3

z1

z20

5

-2

1

5+3λ5

Page 80: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search

g(z) = |z|

Doesn’t always work

Page 81: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction

zt+1 ← zt – λtg’(zt) limT→∞∑1T λt = ∞

g(z) = |z|

Convergence

limt→∞ λt = 0

Page 82: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Learning Objective

Constrained problem?

Page 83: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Learning Objective

Subgradient?

g(z) – g(z0) ≥ sT(z-z0)

Page 84: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Subgradient

Ψ(xi,y) - Ψ(xi,yi)

Page 85: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Proof?

Page 86: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference

Page 87: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Inference

Classification inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

Output: y {1,2,…,C}

Brute-force search

Page 88: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Inference

Detection inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

Output: y {1,2,…,C}

Brute-force search

Page 89: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Inference

Segmentation inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

car

roadgrass

treeskymaxy ∑a (wa)TΨu(xi

a,ya)

+ ∑a,b (wab)TΨp(xiab,yab)

+ ∑a Δ(yia,ya)

Week 5 “Optimization” lectures

Page 90: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Subgradient Descent

Start at some parameter w0

For t = 0 to T

End

s = 2wt

For i = 1 to n

// Number of iterations

// Number of samples

End

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Page 91: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Subgradient Descent

Start at some parameter w0

For t = 0 to T

End

s = 2wt

For i = 1 to n

// Number of iterations

// Number of samples

End

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Page 92: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Learning Objective

Page 93: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Stochastic Approximation

Choose a sample ‘i’ with probability 1/n

Page 94: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Training data {(xi,yi), i = 1,2,…,n}

Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Stochastic Approximation

Choose a sample ‘i’ with probability 1/n

Expected value? Original objective function

Page 95: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Stochastic Subgradient Descent

Start at some parameter w0

For t = 0 to T

End

s = 2wt

Choose a sample ‘i’ with probability 1/n

// Number of iterations

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Page 96: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Convergence Rate

Compute an ε-optimal solution

C: SSVM hyperparameter

d: Number of non-zeros in the feature vector

O(dC/ε) iterations

Each iteration requires solving an inference problem

Page 97: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Side Note: Structured Output CNN

conv1 conv2 conv3 conv4 conv5fc6

fc7SSVM

Back-propagate the subgradients

Page 98: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction

• Structured Output SVM

• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Lacoste-Julien et al. ICML 2013

Page 99: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Slide courtesy Martin Jaggi

Conditional Gradient

Page 100: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Slide courtesy Martin Jaggi

Conditional Gradient

Page 101: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Slide courtesy Martin Jaggi

Conditional Gradient

Page 102: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Slide courtesy Martin Jaggi

Conditional Gradient

Page 103: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

SSVM Primal

wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Derive dual on board

Page 104: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

SSVM Dual

∑y αi(y) = C

maxα ||Mα||2/4 + bTα

for all i

αi(y) ≥ 0 for all i, y

w = Mα/2

bT = [Δ(yi,y)]

Page 105: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Linear Program

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i

αi(y) ≥ 0 for all i, y

Solve this over all possible α

Standard Frank-Wolfe

Solve this over all possible αi for a sample ‘i’

Block Coordinate Frank-Wolfe

Page 106: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Linear Program

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i

αi(y) ≥ 0 for all i, y

Vertices?

αi(y) =C, if y = ŷ

0, otherwise

Page 107: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Solution

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i

αi(y) ≥ 0 for all i, y

ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)}

si(y) =C, if y = ŷ

0, otherwise

Inference

Which one maximizes the linear function?

Page 108: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Update

αt+1 = (1-μ) αt + μs

Standard Frank-Wolfe

s contains the solution for all the samples

Block Coordinate Frank-Wolfe

s contains the solution for sample ‘i’

sj = αtj for all other samples

Page 109: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Step-Size

αt+1 = (1-μ) αt + μs

Maximizing a quadratic function in one variable μ

Analytical computation of optimal step-size

Page 110: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Comparison

OCR Dataset

Page 111: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

Page 112: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Optical Character Recognition

Identify each letter in a handwritten word

Taskar, Guestrin and Koller, NIPS 2003

Page 113: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Optical Character Recognition

Taskar, Guestrin and Koller, NIPS 2003

X1 X2 X3 X4

Labels L = {a, b, …., z}

Logistic Regression Multi-Class SVM

Page 114: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Optical Character Recognition

Taskar, Guestrin and Koller, NIPS 2003

X1 X2 X3 X4

Labels L = {a, b, …., z}

Maximum Likelihood Structured Output SVM

Page 115: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Optical Character Recognition

Taskar, Guestrin and Koller, NIPS 2003

Page 116: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

Page 117: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

X1 X2 X3

X4 X5 X6

X7 X8 X9

Labels L = {0, 1}

Page 118: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

Unary Max Likelihood SSVM0

5

10

15

20

25

Page 119: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

Page 120: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Scene Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP9.6

9.8

10

10.2

10.4

10.6

10.8

11

11.2

11.4

Page 121: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Reuters Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP0

2

4

6

8

10

12

14

16

18

Page 122: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Yeast Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP0

5

10

15

20

25

30

35

40

45

50

Page 123: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Mediamill Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP0

5

10

15

20

25

30

35

40

Page 124: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

Page 125: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

“Jumping” Classification

Page 126: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Threshold the score for classification

Page 127: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

“Jumping” RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

Page 128: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Ranking vs. ClassificationRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1= 0.92 = 0.67= 0.81

Page 129: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Sort the score for ranking

Page 130: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Computes subgradients of the AP loss

Page 131: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Train

ing

Tim

e

0-1

AP

5x slower

Yue, Finley, Radlinski and Joachims, SIGIR 2007

Avera

ge P

reci

sion

0-1 AP

4% improvementfor free

Page 132: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Efficient Optimization ofAverage Precision

Pritish Mohapatra C. V. Jawahar M. Pawan Kumar

Page 133: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Train

ing

Tim

e

0-1

AP

5x slowerAP

Slightly faster

Each iteration for AP optimization is slightly slower

It takes fewer iterations to converge in practice

Page 134: Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Questions?