Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Learning from Big Data Lecture 5

M. Pawan Kumar

http://www.robots.ox.ac.uk/~oval/

Slides available online http://mpawankumar.info

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results

Outline

Is this an urban or rural area?

Input: x Output: y {-1,+1}

Image Classification

Is this scan healthy or unhealthy?



y

xObserved input

Unobserved output

Label -1

Label +1Probabilistic

GraphicalModel


Feature Vector

x

FeatureΦ(x)

conv1 conv2 conv3 conv4 conv5fc6

fc7

Feature Vector

x

FeatureΦ(x)

Pre-Trained CNN

Joint Feature Vector


Ψ(x,y)



Ψ(x,-1)

Φ(x)

0

=



Ψ(x,+1)

0

Φ(x)

=

Score Function


Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction


Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

Maximize the score over all possible outputs

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning


• Optimization

• Results

Outline

Which city is this?

Input: x Output: y {1,2,…,C}


What type of tumor does this scan contain?



y

xObserved input

Unobserved output

123

C

GraphicalModel



fc7

Feature Vector

x

FeatureΦ(x)

Pre-Trained CNN



Ψ(x,y)



Ψ(x,1)

Φ(x)

0=

.

.

.

0



Ψ(x,2)

0

Φ(x)=

.

.

.

0



Ψ(x,C)

0

=

.

.

.

Φ(x)

0

Where is the object in the image?

Input: x Output: y {Pixels}

Object Detection

Where is the rupture in the scan?

Input: x Output: y {Pixels}

Object Detection

y

xObserved input

Unobserved output

123

C

GraphicalModel

Object Detection


fc7


x

Ψ(x,y)

Pre-Trained CNNy


fc7


x

Ψ(x,y)

Pre-Trained CNNy


fc7


x

Ψ(x,y)

Pre-Trained CNNy

Score Function


Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction


Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)


Maximize the score over all possible outputs



• Optimization

• Results

Outline

What is the semantic class of each pixel?

Input: x Output: y {1,2,…,C}m

car

roadgrass

treesky

sky

Segmentation

What is the muscle group of each pixel?


Segmentation

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

GraphicalModel

Segmentation


fc7

FeatureΦ(x1)

Pre-Trained CNNx1

Feature Vector


Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,1)

Φ(x1)

0=

.

.

.

0



Ψu(x1,2)

0

Φ(x1)=

.

.

.

0



Ψu(x1,C)

0

=

.

.

.

Φ(x1)

0


fc7

FeatureΦ(x2)

Pre-Trained CNNx2

Feature Vector



Ψu(x2,1)

Φ(x2)

0=

.

.

.

0



Ψu(x2,2)

0

Φ(x2)=

.

.

.

0



Ψu(x2,C)

0

=

.

.

.

Φ(x2)

0

Overall Joint Feature Vector


Ψu(x,y)

Ψu(x1,y1)

=

.

.

.

Ψu(xm,ym)

Ψu(x2,y2)

Score Function


Ψu(x,y)f: → (-∞,+∞) wTΨu(x,y)

Prediction


Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy f(Ψu(x,y))

→ (-∞,+∞)

Prediction



y* = argmaxy wTΨu(x,y)

→ (-∞,+∞)

Prediction



y* = argmaxy ∑a (wa)TΨu(xa,ya)

Maximize for each a {1,2,…,m} independently

→ (-∞,+∞)

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

GraphicalModel

Segmentation

Unary Joint Feature Vector


Ψu(x,y)

Ψu(x1,y1)

=

.

.

.

Ψu(xm,ym)

Ψu(x2,y2)

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Pairwise Joint Feature Vector

y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Ψp(x12,y12) = δ(y1=y2)


y1

x1

y2

x2

y3

x3

y4

x4

y5

x5

y6

x6

y7

x7

y8

x8

y9

x9

Ψp(x23,y23) = δ(y2=y3)



Ψp(x,y)

Ψp(x12,y12)

=

.

.

.

Ψp(x23,y23)


Overall Joint Feature Vector


Ψ(x,y)Ψu(x,y)

=Ψp(x,y)

Score Function


Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction


Ψ(x,y)f: wTΨ(x,y)


→ (-∞,+∞)

Prediction


Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy wTΨ(x,y)

→ (-∞,+∞)

Prediction


Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy ∑a (wa)TΨu(xa,ya)

→ (-∞,+∞)

+ ∑a,b (wab)TΨp(xab,yab)

Week 5 “Optimization” lectures

Input x,Outputs{y1,y2,..}

FeaturesΨ(x,yi)

Scoresf(Ψ(x,yi))

Extract Features

ComputeScores

maxyi f(Ψ(x,yi))

Predictiony(f)

How do I fix “f”?

Summary



• Optimization

• Results

Outline

Data distribution P(x,y)

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Measure of prediction quality

Distribution is unknown

Expectation overdata distribution

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth


Expectation overdata distribution

Learning Objective


Prediction

f* = argminf Σi Error(yi(f),yi)

Ground Truth


Expectation overempirical distribution

Finite samples

Learning Objective


f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

RegularizerRelative weight(hyperparameter)

Learning Objective


f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

Learning Objective

Error can be negative log-likelihood

Probabilistic model



• Optimization

• Results

Outline

Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004

Score Function and Prediction

Input: x Output: y

Joint feature vector of input and output: Ψ(x,y)

f(Ψ(x,y)) = wTΨ(x,y)

Prediction: maxy wTΨ(x,y)

Predicted Output: y(w) = argmaxy wTΨ(x,y)

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Classification loss?

User specified

“New York” 0

“Paris” 1

Δ(y,y(w)) = δ(y=y(w))

Δ(y,y(w))


Error Function

Detection loss?

User specified

Overlap score

Area of intersection

Area of union

Δ(y,y(w))


Error Function

Segmentation loss?

User specified

car

roadgrass

treesky Fraction of incorrect pixels

Micro-average

Macro-average


Δ(yi,yi(w))

Loss function for i-th sample

Minimize the regularized sum of loss over training data

Highly non-convex in w

Regularization plays no role (overfitting may occur)

Learning Objective


Δ(yi,yi(w))wTΨ(xi,yi(w)) + - wTΨ(xi,yi(w))

≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi)

≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi)

ConvexSensitive to regularization of w

Learning Objective


wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Learning Objective

Quadratic program with large number of constraints

Many polynomial time algorithms



• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Shalev-Shwartz et al. Mathematical Programming 2011

Convex function g(z)

Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

g(z) = z2

Gradient? 2z0

minz g(z)

Gradient DescentStart at some point z0

g(z) = z2

Move along the negative gradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search


Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not exist

g(z) = |z|

s?


Subgradient

Subgradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not be unique

g(z) = |z|

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction


g(z) = |z|

Doesn’t always work

minz max{z2 + 2z1, z2 - 2z1}

Subgradient Descent

g(z) = 5

g(z) = 4

g(z) = 3

z1

z20

5

-2

1

-λ

5+3λ5

minz g(z)




g(z) = |z|

Doesn’t always work

minz g(z)



zt+1 ← zt – λtg’(zt) limT→∞∑1T λt = ∞

g(z) = |z|

Convergence

limt→∞ λt = 0




Learning Objective

Constrained problem?


C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Learning Objective

Subgradient?

g(z) – g(z0) ≥ sT(z-z0)


Subgradient

Ψ(xi,y) - Ψ(xi,yi)


Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Proof?


Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference

Inference

Classification inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

Output: y {1,2,…,C}

Brute-force search

Inference

Detection inference


Output: y {1,2,…,C}

Brute-force search

Inference

Segmentation inference


car

roadgrass

treeskymaxy ∑a (wa)TΨu(xi

a,ya)

+ ∑a,b (wab)TΨp(xiab,yab)

+ ∑a Δ(yia,ya)

Week 5 “Optimization” lectures

Subgradient Descent

Start at some parameter w0

For t = 0 to T

End

s = 2wt

For i = 1 to n

// Number of iterations

// Number of samples

End

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Subgradient Descent


For t = 0 to T

End

s = 2wt

For i = 1 to n


// Number of samples

End


s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)



minw ||w||2 +

Learning Objective



minw ||w||2 +

Stochastic Approximation

Choose a sample ‘i’ with probability 1/n


Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Stochastic Approximation


Expected value? Original objective function

Stochastic Subgradient Descent


For t = 0 to T

End

s = 2wt




s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Convergence Rate

Compute an ε-optimal solution

C: SSVM hyperparameter

d: Number of non-zeros in the feature vector

O(dC/ε) iterations

Each iteration requires solving an inference problem

Side Note: Structured Output CNN


fc7SSVM

Back-propagate the subgradients



• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Lacoste-Julien et al. ICML 2013

Slide courtesy Martin Jaggi

Conditional Gradient







SSVM Primal



Derive dual on board

SSVM Dual

∑y αi(y) = C

maxα ||Mα||2/4 + bTα

for all i

αi(y) ≥ 0 for all i, y

w = Mα/2

bT = [Δ(yi,y)]

Linear Program

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i


Solve this over all possible α

Standard Frank-Wolfe

Solve this over all possible αi for a sample ‘i’

Block Coordinate Frank-Wolfe

Linear Program

∑y αi(y) = C


for all i


Vertices?

αi(y) =C, if y = ŷ

0, otherwise

Solution

∑y αi(y) = C


for all i


ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)}

si(y) =C, if y = ŷ

0, otherwise

Inference

Which one maximizes the linear function?

Update

αt+1 = (1-μ) αt + μs

Standard Frank-Wolfe

s contains the solution for all the samples

Block Coordinate Frank-Wolfe

s contains the solution for sample ‘i’

sj = αtj for all other samples

Step-Size

αt+1 = (1-μ) αt + μs

Maximizing a quadratic function in one variable μ

Analytical computation of optimal step-size

Comparison

OCR Dataset



• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

Optical Character Recognition

Identify each letter in a handwritten word

Taskar, Guestrin and Koller, NIPS 2003



X1 X2 X3 X4

Labels L = {a, b, …., z}

Logistic Regression Multi-Class SVM



X1 X2 X3 X4

Labels L = {a, b, …., z}

Maximum Likelihood Structured Output SVM



Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

Image Segmentation


X1 X2 X3

X4 X5 X6

X7 X8 X9

Labels L = {0, 1}

Image Segmentation


Unary Max Likelihood SSVM0

5

10

15

20

25



• Optimization


Outline

Scene Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP9.6

9.8

10

10.2

10.4

10.6

10.8

11

11.2

11.4

Reuters Dataset


Greedy LBP Combine Exact LP0

2

4

6

8

10

12

14

16

18

Yeast Dataset



5

10

15

20

25

30

35

40

45

50

Mediamill Dataset



5

10

15

20

25

30

35

40



• Optimization


Outline

“Jumping” Classification

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Threshold the score for classification

“Jumping” RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

Ranking vs. ClassificationRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1= 0.92 = 0.67= 0.81

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Sort the score for ranking

Computes subgradients of the AP loss

Train

ing

Tim

e

0-1

AP

5x slower

Yue, Finley, Radlinski and Joachims, SIGIR 2007

Avera

ge P

reci

sion

0-1 AP

4% improvementfor free

Efficient Optimization ofAverage Precision

Pritish Mohapatra C. V. Jawahar M. Pawan Kumar

Train

ing

Tim

e

0-1

AP

5x slowerAP

Slightly faster

Each iteration for AP optimization is slightly slower

It takes fewer iterations to converge in practice

Questions?

Documents

Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online