142
Loss-based Learning with Weak Supervision M. Pawan Kumar

Loss-based Learning with Weak Supervision M. Pawan Kumar

Embed Size (px)

Citation preview

Page 1: Loss-based Learning with Weak Supervision M. Pawan Kumar

Loss-based Learning with Weak Supervision

M. Pawan Kumar

Page 2: Loss-based Learning with Weak Supervision M. Pawan Kumar

Segmentation

Information

Log (

Siz

e)

~ 2000

Computer Vision Data

Page 3: Loss-based Learning with Weak Supervision M. Pawan Kumar

Segmentation

Log (

Siz

e)

~ 2000

Information

Bounding Box

~ 1 M

Computer Vision Data

Page 4: Loss-based Learning with Weak Supervision M. Pawan Kumar

Segmentation

Log (

Siz

e)

Bounding BoxImage-Level~ 2000

~ 1 M> 14 M

“Car” “Chair”Information

Computer Vision Data

Page 5: Loss-based Learning with Weak Supervision M. Pawan Kumar

Segmentation

Log (

Siz

e)

Image-Level

Noisy Label~ 2000

> 14 M

> 6 B

Information

Bounding Box

~ 1 M

Computer Vision Data

Page 6: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learn with missing information (latent variables)

Detailed annotation is expensive

Sometimes annotation is impossible

Desired annotation keeps changing

Computer Vision Data

Page 7: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Two Types of Problems

• Part I – Annotation Mismatch

• Part II – Output Mismatch

Outline

Page 8: Loss-based Learning with Weak Supervision M. Pawan Kumar

Annotation Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Mismatch between desired and available annotations

Exact value of latent variable is not “important”

Desired output during test time is y

Page 9: Loss-based Learning with Weak Supervision M. Pawan Kumar

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Page 10: Loss-based Learning with Weak Supervision M. Pawan Kumar

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

Page 11: Loss-based Learning with Weak Supervision M. Pawan Kumar

Part I

Page 12: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice

• Extensions

Outline – Annotation Mismatch

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Page 13: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Data

Input x

Output y {-1,+1}

Hidden h

x

y = +1

h

Page 14: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

x

y = +1

h

Page 15: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,+1,h) Φ(x,h)

0

=

x

y = +1

h

Page 16: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,-1,h) 0

Φ(x,h)

=

x

y = +1

h

Page 17: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

Score f : Ψ(x,y,h) (-∞, +∞)

Optimize score over all possible y and h

x

y = +1

h

Page 18: Loss-based Learning with Weak Supervision M. Pawan Kumar

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Latent SVM

Parameters

Page 19: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learning Latent SVM

(yi, yi(w))Σi

Empirical risk minimization

minw

No restriction on the loss function

Annotation mismatch

Training data {(xi,yi), i = 1,2,…,n}

Page 20: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learning Latent SVM

(yi, yi(w))Σi

Empirical risk minimization

minw

Non-convex

Parameters cannot be regularized

Find a regularization-sensitive upper bound

Page 21: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learning Latent SVM

- wT(xi,yi(w),hi(w))

(yi, yi(w))wT(xi,yi(w),hi(w)) +

Page 22: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learning Latent SVM

(yi, yi(w))wT(xi,yi(w),hi(w)) +

- maxhi wT(xi,yi,hi)

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Page 23: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learning Latent SVM

(yi, y)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Parameters can be regularized

Is this also convex?

Page 24: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learning Latent SVM

(yi, y)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Convex Convex-

Difference of convex (DC) program

Page 25: Loss-based Learning with Weak Supervision M. Pawan Kumar

minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Learning

Recap

Page 26: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice

• Extensions

Outline – Annotation Mismatch

Page 27: Loss-based Learning with Weak Supervision M. Pawan Kumar

Learning Latent SVM

(yi, y)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Difference of convex (DC) program

Page 28: Loss-based Learning with Weak Supervision M. Pawan Kumar

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper-bound of concave part

Page 29: Loss-based Learning with Weak Supervision M. Pawan Kumar

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Optimize the convex upper bound

Page 30: Loss-based Learning with Weak Supervision M. Pawan Kumar

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper-bound of concave part

Page 31: Loss-based Learning with Weak Supervision M. Pawan Kumar

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Until Convergence

Page 32: Loss-based Learning with Weak Supervision M. Pawan Kumar

Concave-Convex Procedure

+

(yi, y)wT(xi,y,h) +

maxy,h

wT(xi,yi,hi)

- maxhi

Linear upper bound?

Page 33: Loss-based Learning with Weak Supervision M. Pawan Kumar

Linear Upper Bound

- maxhi wT(xi,yi,hi)

-wT(xi,yi,hi*)

hi* = argmaxhi wt

T(xi,yi,hi)

Current estimate = wt

≥ - maxhi wT(xi,yi,hi)

Page 34: Loss-based Learning with Weak Supervision M. Pawan Kumar

CCCP for Latent SVMStart with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Page 35: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice

• Extensions

Outline – Annotation Mismatch

Page 36: Loss-based Learning with Weak Supervision M. Pawan Kumar

Action ClassificationInput x

Output y = “Using Computer”

PASCAL VOC 2011

80/20 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Page 37: Loss-based Learning with Weak Supervision M. Pawan Kumar

• 0-1 loss function

• Poselet-based feature vector

• 4 seeds for random initialization

• Code + Data

• Train/Test scripts with hyperparameter settings

Setup

http://www.centrale-ponts.fr/tutorials/cvpr2013/

Page 38: Loss-based Learning with Weak Supervision M. Pawan Kumar

Objective

Page 39: Loss-based Learning with Weak Supervision M. Pawan Kumar

Train Error

Page 40: Loss-based Learning with Weak Supervision M. Pawan Kumar

Test Error

Page 41: Loss-based Learning with Weak Supervision M. Pawan Kumar

Time

Page 42: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Page 43: Loss-based Learning with Weak Supervision M. Pawan Kumar

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Overfitting in initial iterations

Page 44: Loss-based Learning with Weak Supervision M. Pawan Kumar

Repeat until convergence

ε’ = ε/K

and ε’ = ε

Start with an initial estimate w0

Update

Update wt+1 as the ε’-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Page 45: Loss-based Learning with Weak Supervision M. Pawan Kumar

Objective

Page 46: Loss-based Learning with Weak Supervision M. Pawan Kumar

Objective

Page 47: Loss-based Learning with Weak Supervision M. Pawan Kumar

Train Error

Page 48: Loss-based Learning with Weak Supervision M. Pawan Kumar

Train Error

Page 49: Loss-based Learning with Weak Supervision M. Pawan Kumar

Test Error

Page 50: Loss-based Learning with Weak Supervision M. Pawan Kumar

Test Error

Page 51: Loss-based Learning with Weak Supervision M. Pawan Kumar

Time

Page 52: Loss-based Learning with Weak Supervision M. Pawan Kumar

Time

Page 53: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Page 54: Loss-based Learning with Weak Supervision M. Pawan Kumar

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Overfitting in initial iterations

Page 55: Loss-based Learning with Weak Supervision M. Pawan Kumar

Repeat until convergence

C’ = C x K

and C’ = C

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C’∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Page 56: Loss-based Learning with Weak Supervision M. Pawan Kumar

Objective

Page 57: Loss-based Learning with Weak Supervision M. Pawan Kumar

Objective

Page 58: Loss-based Learning with Weak Supervision M. Pawan Kumar

Train Error

Page 59: Loss-based Learning with Weak Supervision M. Pawan Kumar

Train Error

Page 60: Loss-based Learning with Weak Supervision M. Pawan Kumar

Test Error

Page 61: Loss-based Learning with Weak Supervision M. Pawan Kumar

Test Error

Page 62: Loss-based Learning with Weak Supervision M. Pawan Kumar

Time

Page 63: Loss-based Learning with Weak Supervision M. Pawan Kumar

Time

Page 64: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Kumar, Packer and Koller, NIPS 2010

Page 65: Loss-based Learning with Weak Supervision M. Pawan Kumar

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Math is for losers !!

FAILURE … BAD LOCAL MINIMUM

CCCP for Human Learning

Page 66: Loss-based Learning with Weak Supervision M. Pawan Kumar

Euler wasa Genius!!

SUCCESS … GOOD LOCAL MINIMUM

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Self-Paced Learning

Page 67: Loss-based Learning with Weak Supervision M. Pawan Kumar

Start with “easy” examples, then consider “hard” ones

Easy vs. Hard

Expensive

Easy for human Easy for machine

Self-Paced Learning

Simultaneously estimate easiness and parametersEasiness is property of data sets, not single instances

Page 68: Loss-based Learning with Weak Supervision M. Pawan Kumar

Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

CCCP for Latent SVM

Page 69: Loss-based Learning with Weak Supervision M. Pawan Kumar

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Self-Paced Learning

Page 70: Loss-based Learning with Weak Supervision M. Pawan Kumar

min ||w||2 + C∑i vii

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

vi {0,1}

Trivial Solution

Self-Paced Learning

Page 71: Loss-based Learning with Weak Supervision M. Pawan Kumar

vi {0,1}

Large K Medium K Small K

min ||w||2 + C∑i vii - ∑ivi/K

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Self-Paced Learning

Page 72: Loss-based Learning with Weak Supervision M. Pawan Kumar

vi [0,1]

min ||w||2 + C∑i vii - ∑ivi/K

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Large K Medium K Small K

BiconvexProblem

AlternatingConvex Search

Self-Paced Learning

Page 73: Loss-based Learning with Weak Supervision M. Pawan Kumar

Start with an initial estimate w0

Update

min ||w||2 + C∑i i - ∑i vi/K

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Decrease K K/

SPL for Latent SVM

Update wt+1 as the ε-optimal solution of

Page 74: Loss-based Learning with Weak Supervision M. Pawan Kumar

Objective

Page 75: Loss-based Learning with Weak Supervision M. Pawan Kumar

Objective

Page 76: Loss-based Learning with Weak Supervision M. Pawan Kumar

Train Error

Page 77: Loss-based Learning with Weak Supervision M. Pawan Kumar

Train Error

Page 78: Loss-based Learning with Weak Supervision M. Pawan Kumar

Test Error

Page 79: Loss-based Learning with Weak Supervision M. Pawan Kumar

Test Error

Page 80: Loss-based Learning with Weak Supervision M. Pawan Kumar

Time

Page 81: Loss-based Learning with Weak Supervision M. Pawan Kumar

Time

Page 82: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice– Annealing the Tolerance– Annealing the Regularization– Self-Paced Learning– Choice of Loss Function

• Extensions

Outline – Annotation Mismatch

Behl, Jawahar and Kumar, In Preparation

Page 83: Loss-based Learning with Weak Supervision M. Pawan Kumar

RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

Page 84: Loss-based Learning with Weak Supervision M. Pawan Kumar

RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1Average Precision = 0.92 Accuracy = 0.67Average Precision = 0.81

Page 85: Loss-based Learning with Weak Supervision M. Pawan Kumar

Ranking

During testing, AP is frequently used

During training, a surrogate loss is used

Contradictory to loss-based learning

Optimize AP directly

Page 86: Loss-based Learning with Weak Supervision M. Pawan Kumar

Results

Statistically significant improvement

Page 87: Loss-based Learning with Weak Supervision M. Pawan Kumar

Speed – Proximal Regularization

Start with an good initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i i + Ct ||w - wt||2

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

Repeat until convergence

Page 88: Loss-based Learning with Weak Supervision M. Pawan Kumar

Speed – Cascades

Weiss and Taskar, AISTATS 2010Sapp, Toshev and Taskar, ECCV 2010

Page 89: Loss-based Learning with Weak Supervision M. Pawan Kumar

Accuracy – (Self) Pacing

Pacing the sample complexity – NIPS 2010

Pacing the model complexity

Pacing the problem complexity

Page 90: Loss-based Learning with Weak Supervision M. Pawan Kumar

Building Accurate Systems

Model

Inference

Learning

85%

5%

10%

Learning cannot provide huge gains without a good model

Inference cannot provide huge gains without a good model

Page 91: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice

• Extensions– Latent Variable Dependent Loss– Max-Margin Min-Entropy Models

Outline – Annotation Mismatch

Yu and Joachims, ICML 2009

Page 92: Loss-based Learning with Weak Supervision M. Pawan Kumar

Latent Variable Dependent Loss

- wT(xi,yi(w),hi(w))

(yi, yi(w), hi(w))wT(xi,yi(w),hi(w)) +

Page 93: Loss-based Learning with Weak Supervision M. Pawan Kumar

Latent Variable Dependent Loss

(yi, yi(w), hi(w))wT(xi,yi(w),hi(w)) +

- maxhi wT(xi,yi,hi)

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Page 94: Loss-based Learning with Weak Supervision M. Pawan Kumar

Latent Variable Dependent Loss

(yi, y, h)wT(xi,y,h) +maxy,h

- maxhi wT(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Page 95: Loss-based Learning with Weak Supervision M. Pawan Kumar

Optimizing Precision@k

Input X = {xi, i = 1, …, n}

Annotation Y = {yi, i = 1, …, n} {-1,+1}n

Latent H = ranking

(Y*, Y, H)

1-Precision@k

Page 96: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Latent SVM

• Optimization

• Practice

• Extensions– Latent Variable Dependent Loss– Max-Margin Min-Entropy (M3E) Models

Outline – Annotation Mismatch

Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

Page 97: Loss-based Learning with Weak Supervision M. Pawan Kumar

Running vs. Jumping Classification

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

Score wTΨ(x,y,h) (-∞, +∞)

wTΨ(x,y1,h)

Page 98: Loss-based Learning with Weak Supervision M. Pawan Kumar

0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00

wTΨ(x,y2,h)

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) (-∞, +∞)

Uncertainty in h

Running vs. Jumping Classification

Page 99: Loss-based Learning with Weak Supervision M. Pawan Kumar

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

MarginalizedProbability

RényiEntropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

M3E

Page 100: Loss-based Learning with Weak Supervision M. Pawan Kumar

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = 1. Shannon Entropy of Generalized Distribution

- Σh Pw(y,h|x) log(Pw(y,h|x))

Σh Pw(y,h|x)

Rényi Entropy

Page 101: Loss-based Learning with Weak Supervision M. Pawan Kumar

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Rényi Entropy

Page 102: Loss-based Learning with Weak Supervision M. Pawan Kumar

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Rényi Entropy

Page 103: Loss-based Learning with Weak Supervision M. Pawan Kumar

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning M3E

Page 104: Loss-based Learning with Weak Supervision M. Pawan Kumar

Δ(yi,yi(w))Gα(yi(w);xi,w) +

Training data {(xi,yi), i = 1,2,…,n}

- Gα(yi(w);xi,w)

Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Learning M3E

Page 105: Loss-based Learning with Weak Supervision M. Pawan Kumar

Training data {(xi,yi), i = 1,2,…,n}

minw ||w||2 + C Σiξi

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results

Learning M3E

Page 106: Loss-based Learning with Weak Supervision M. Pawan Kumar

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding Results

Page 107: Loss-based Learning with Weak Supervision M. Pawan Kumar

Part II

Page 108: Loss-based Learning with Weak Supervision M. Pawan Kumar

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

Page 109: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Page 110: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Data

Input x

Output y {0,1,…,C}

Hidden h

x

y = 0

h

Page 111: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

x

y = 0

h

Page 112: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,0,h) Φ(x,h)

0

=

x

y = 0

h

0

.

.

.

Page 113: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,1,h) 0

Φ(x,h)

=

x

y = 0

h

0

.

.

.

Page 114: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,C,h) 0

0

=

x

y = 0

h

Φ(x,h)

.

.

.

Page 115: Loss-based Learning with Weak Supervision M. Pawan Kumar

Weakly Supervised Detection

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

Score f : Ψ(x,y,h) (-∞, +∞)

Optimize score over all possible y and h

x

y = 0

h

Page 116: Loss-based Learning with Weak Supervision M. Pawan Kumar

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Linear Model

Parameters

Page 117: Loss-based Learning with Weak Supervision M. Pawan Kumar

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w))Weakly

Supervised Samples

Page 118: Loss-based Learning with Weak Supervision M. Pawan Kumar

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi)Σhi

Page 119: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Kumar, Packer and Koller, ICML 2012

Page 120: Loss-based Learning with Weak Supervision M. Pawan Kumar

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Page 121: Loss-based Learning with Weak Supervision M. Pawan Kumar

Solution

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Page 122: Loss-based Learning with Weak Supervision M. Pawan Kumar

Solution

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

Page 123: Loss-based Learning with Weak Supervision M. Pawan Kumar

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Page 124: Loss-based Learning with Weak Supervision M. Pawan Kumar

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

Page 125: Loss-based Learning with Weak Supervision M. Pawan Kumar

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Page 126: Loss-based Learning with Weak Supervision M. Pawan Kumar

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

User-defined dissimilarity measure

Page 127: Loss-based Learning with Weak Supervision M. Pawan Kumar

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Page 128: Loss-based Learning with Weak Supervision M. Pawan Kumar

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)

Page 129: Loss-based Learning with Weak Supervision M. Pawan Kumar

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)

Page 130: Loss-based Learning with Weak Supervision M. Pawan Kumar

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi

Page 131: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Page 132: Loss-based Learning with Weak Supervision M. Pawan Kumar

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Page 133: Loss-based Learning with Weak Supervision M. Pawan Kumar

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Page 134: Loss-based Learning with Weak Supervision M. Pawan Kumar

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Page 135: Loss-based Learning with Weak Supervision M. Pawan Kumar

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Page 136: Loss-based Learning with Weak Supervision M. Pawan Kumar

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Page 137: Loss-based Learning with Weak Supervision M. Pawan Kumar

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)

Page 138: Loss-based Learning with Weak Supervision M. Pawan Kumar

• Problem Formulation

• Dissimilarity Coefficient Learning

• Optimization

• Experiments

Outline – Output Mismatch

Page 139: Loss-based Learning with Weak Supervision M. Pawan Kumar

Action DetectionInput x

Output y = “Using Computer”

Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Page 140: Loss-based Learning with Weak Supervision M. Pawan Kumar

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.2

0.4

0.6

0.8

1

1.2

Average Test Loss

LSVMOur

Statistically Significant

Page 141: Loss-based Learning with Weak Supervision M. Pawan Kumar

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74

Average Test Loss

LSVMOur

Statistically Significant

Page 142: Loss-based Learning with Weak Supervision M. Pawan Kumar

Questions?

http://www.centrale-ponts.fr/personnel/pawan