Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Learning from Big Data Lecture 5

M. Pawan Kumar

http://www.robots.ox.ac.uk/~oval/

Slides available online http://mpawankumar.info

• Structured Output Prediction

• Structured Output SVM

• Optimization

• Results

Outline

Is this an urban or rural area?

Input: x Output: y {-1,+1}

Image Classification

Is this scan healthy or unhealthy?

xObserved input

Unobserved output

Label -1

Label +1Probabilistic

GraphicalModel

Feature Vector

FeatureΦ(x)

conv1 conv2 conv3 conv4 conv5fc6

Feature Vector

FeatureΦ(x)

Pre-Trained CNN

Joint Feature Vector

Ψ(x,y)

Ψ(x,-1)

Ψ(x,+1)

Score Function

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

y* = argmaxy f(Ψ(x,y))

Maximize the score over all possible outputs

• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning

• Optimization

• Results

Outline

Which city is this?

Input: x Output: y {1,2,…,C}

What type of tumor does this scan contain?

xObserved input

Unobserved output

GraphicalModel

Feature Vector

FeatureΦ(x)

Pre-Trained CNN

Ψ(x,y)

Ψ(x,1)

Ψ(x,2)

Φ(x)=

Ψ(x,C)

Where is the object in the image?

Input: x Output: y {Pixels}

Object Detection

Where is the rupture in the scan?

Input: x Output: y {Pixels}

Object Detection

xObserved input

Unobserved output

GraphicalModel

Object Detection

Ψ(x,y)

Pre-Trained CNNy

Ψ(x,y)

Pre-Trained CNNy

Ψ(x,y)

Pre-Trained CNNy

Score Function

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Maximize the score over all possible outputs

• Optimization

• Results

Outline

What is the semantic class of each pixel?

Input: x Output: y {1,2,…,C}m

roadgrass

treesky

Segmentation

What is the muscle group of each pixel?

Segmentation

GraphicalModel

Segmentation

FeatureΦ(x1)

Pre-Trained CNNx1

Feature Vector

Input: x1 Output: y1 {1,2,…,C}

Ψu(x1,1)

Φ(x1)

Ψu(x1,2)

Φ(x1)=

Ψu(x1,C)

Φ(x1)

FeatureΦ(x2)

Pre-Trained CNNx2

Feature Vector

Ψu(x2,1)

Φ(x2)

Ψu(x2,2)

Φ(x2)=

Ψu(x2,C)

Φ(x2)

Overall Joint Feature Vector

Ψu(x,y)

Ψu(x1,y1)

Ψu(xm,ym)

Ψu(x2,y2)

Score Function

Ψu(x,y)f: → (-∞,+∞) wTΨu(x,y)

Prediction

Ψu(x,y)f: wTΨu(x,y)

y* = argmaxy f(Ψu(x,y))

→ (-∞,+∞)

Prediction

y* = argmaxy wTΨu(x,y)

→ (-∞,+∞)

Prediction

y* = argmaxy ∑a (wa)TΨu(xa,ya)

Maximize for each a {1,2,…,m} independently

→ (-∞,+∞)

GraphicalModel

Segmentation

Unary Joint Feature Vector

Ψu(x,y)

Ψu(x1,y1)

Ψu(xm,ym)

Ψu(x2,y2)

Pairwise Joint Feature Vector

Ψp(x12,y12) = δ(y1=y2)

Ψp(x23,y23) = δ(y2=y3)

Ψp(x,y)

Ψp(x12,y12)

Ψp(x23,y23)

Overall Joint Feature Vector

Ψ(x,y)Ψu(x,y)

=Ψp(x,y)

Score Function

Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)

Prediction

Ψ(x,y)f: wTΨ(x,y)

→ (-∞,+∞)

Prediction

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy wTΨ(x,y)

→ (-∞,+∞)

Prediction

Ψ(x,y)f: wTΨ(x,y)

y* = argmaxy ∑a (wa)TΨu(xa,ya)

→ (-∞,+∞)

+ ∑a,b (wab)TΨp(xab,yab)

Week 5 “Optimization” lectures

Input x,Outputs{y1,y2,..}

FeaturesΨ(x,yi)

Scoresf(Ψ(x,yi))

Extract Features

ComputeScores

maxyi f(Ψ(x,yi))

Predictiony(f)

How do I fix “f”?

Summary

• Optimization

• Results

Outline

Data distribution P(x,y)

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Measure of prediction quality

Distribution is unknown

Expectation overdata distribution

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Expectation overdata distribution

Learning Objective

Prediction

f* = argminf Σi Error(yi(f),yi)

Ground Truth

Expectation overempirical distribution

Finite samples

Learning Objective

f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

RegularizerRelative weight(hyperparameter)

Learning Objective

f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

Learning Objective

Error can be negative log-likelihood

Probabilistic model

• Optimization

• Results

Outline

Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004

Score Function and Prediction

Input: x Output: y

Joint feature vector of input and output: Ψ(x,y)

f(Ψ(x,y)) = wTΨ(x,y)

Prediction: maxy wTΨ(x,y)

Predicted Output: y(w) = argmaxy wTΨ(x,y)

Δ(y,y(w))

Loss or risk of prediction given ground-truth

Error Function

Classification loss?

User specified

“New York” 0

“Paris” 1

Δ(y,y(w)) = δ(y=y(w))

Δ(y,y(w))

Error Function

Detection loss?

User specified

Overlap score

Area of intersection

Area of union

Δ(y,y(w))

Error Function

Segmentation loss?

User specified

roadgrass

treesky Fraction of incorrect pixels

Micro-average

Macro-average

Δ(yi,yi(w))

Loss function for i-th sample

Minimize the regularized sum of loss over training data

Highly non-convex in w

Regularization plays no role (overfitting may occur)

Learning Objective

Δ(yi,yi(w))wTΨ(xi,yi(w)) + - wTΨ(xi,yi(w))

≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi)

≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi)

ConvexSensitive to regularization of w

Learning Objective

wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Learning Objective

Quadratic program with large number of constraints

Many polynomial time algorithms

• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Shalev-Shwartz et al. Mathematical Programming 2011

Convex function g(z)

Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

g(z) = z2

Gradient? 2z0

minz g(z)

Gradient DescentStart at some point z0

g(z) = z2

Move along the negative gradient direction

zt+1 ← zt – λtg’(zt) Estimate step-size via line search

Gradient

Gradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not exist

g(z) = |z|

Subgradient

Subgradient s at a point z0

g(z) – g(z0) ≥ sT(z-z0)

May not be unique

g(z) = |z|

minz g(z)

Subgradient DescentStart at some point z0

Move along the negative subgradient direction

g(z) = |z|

Doesn’t always work

minz max{z2 + 2z1, z2 - 2z1}

Subgradient Descent

g(z) = 5

g(z) = 4

g(z) = 3

5+3λ5

minz g(z)

g(z) = |z|

Doesn’t always work

minz g(z)

zt+1 ← zt – λtg’(zt) limT→∞∑1T λt = ∞

g(z) = |z|

Convergence

limt→∞ λt = 0

Learning Objective

Constrained problem?

C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Learning Objective

Subgradient?

g(z) – g(z0) ≥ sT(z-z0)

Subgradient

Ψ(xi,y) - Ψ(xi,yi)

Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

Proof?

Subgradient

Ψ(xi,ŷ) - Ψ(xi,yi)

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference

Inference

Classification inference

ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}

Output: y {1,2,…,C}

Brute-force search

Inference

Detection inference

Output: y {1,2,…,C}

Brute-force search

Inference

Segmentation inference

roadgrass

treeskymaxy ∑a (wa)TΨu(xi

+ ∑a,b (wab)TΨp(xiab,yab)

+ ∑a Δ(yia,ya)

Week 5 “Optimization” lectures

Subgradient Descent

Start at some parameter w0

For t = 0 to T

s = 2wt

For i = 1 to n

// Number of iterations

// Number of samples

ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}

s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Subgradient Descent

For t = 0 to T

s = 2wt

For i = 1 to n

// Number of samples

s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

minw ||w||2 +

Learning Objective

minw ||w||2 +

Stochastic Approximation

Choose a sample ‘i’ with probability 1/n

Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}

minw ||w||2 +

Stochastic Approximation

Expected value? Original objective function

Stochastic Subgradient Descent

For t = 0 to T

s = 2wt

s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi))

wt+1 = wt + λtst λt = 1/(t+1)

Convergence Rate

Compute an ε-optimal solution

C: SSVM hyperparameter

d: Number of non-zeros in the feature vector

O(dC/ε) iterations

Each iteration requires solving an inference problem

Side Note: Structured Output CNN

fc7SSVM

Back-propagate the subgradients

• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe

• Results

Outline

Lacoste-Julien et al. ICML 2013

Slide courtesy Martin Jaggi

Conditional Gradient

SSVM Primal

Derive dual on board

SSVM Dual

∑y αi(y) = C

maxα ||Mα||2/4 + bTα

for all i

αi(y) ≥ 0 for all i, y

w = Mα/2

bT = [Δ(yi,y)]

Linear Program

∑y αi(y) = C

maxα (Mα)Twt + bTα

for all i

Solve this over all possible α

Standard Frank-Wolfe

Solve this over all possible αi for a sample ‘i’

Block Coordinate Frank-Wolfe

Linear Program

∑y αi(y) = C

for all i

Vertices?

αi(y) =C, if y = ŷ

0, otherwise

Solution

∑y αi(y) = C

for all i

ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)}

si(y) =C, if y = ŷ

0, otherwise

Inference

Which one maximizes the linear function?

Update

αt+1 = (1-μ) αt + μs

Standard Frank-Wolfe

s contains the solution for all the samples

Block Coordinate Frank-Wolfe

s contains the solution for sample ‘i’

sj = αtj for all other samples

Step-Size

αt+1 = (1-μ) αt + μs

Maximizing a quadratic function in one variable μ

Analytical computation of optimal step-size

Comparison

OCR Dataset

• Optimization

• Results– Exact Inference– Approximate Inference– Choice of Loss Function

Outline

Optical Character Recognition

Identify each letter in a handwritten word

Taskar, Guestrin and Koller, NIPS 2003

X1 X2 X3 X4

Labels L = {a, b, …., z}

Logistic Regression Multi-Class SVM

X1 X2 X3 X4

Labels L = {a, b, …., z}

Maximum Likelihood Structured Output SVM

Image Segmentation

Szummer, Kohli and Hoiem, ECCV 2006

Image Segmentation

X1 X2 X3

X4 X5 X6

X7 X8 X9

Labels L = {0, 1}

Image Segmentation

Unary Max Likelihood SSVM0

• Optimization

Outline

Scene Dataset

Finley and Joachims, ICML 2008

Greedy LBP Combine Exact LP9.6

Reuters Dataset

Greedy LBP Combine Exact LP0

Yeast Dataset

Mediamill Dataset

• Optimization

Outline

“Jumping” Classification

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Threshold the score for classification

“Jumping” RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

Ranking vs. ClassificationRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1= 0.92 = 0.67= 0.81

Standard Pipeline

Collect dataset D = {(xi,yi), i = 1, …., n}

Learn your favourite classifier

Classifier assigns a score to each test sample

Sort the score for ranking

Computes subgradients of the AP loss

5x slower

Yue, Finley, Radlinski and Joachims, SIGIR 2007

0-1 AP

4% improvementfor free

Efficient Optimization ofAverage Precision

Pritish Mohapatra C. V. Jawahar M. Pawan Kumar

5x slowerAP

Slightly faster

Each iteration for AP optimization is slightly slower

It takes fewer iterations to converge in practice

Questions?

Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online

Documents

Discrete Optimization Lecture 6 Part 1 M. Pawan Kumar

PRESANTATION OF POWER POINT Submitted by:- Name --- Pawan Kumar Submitted to:-

Seminar (Pawan Kumar Nagar)

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller pawan koller Aim: To efficiently learn a

Learning to Segment with Diverse Data M. Pawan Kumar Stanford University

PREPARED BY :- PAWAN KUMAR NAGAR MOB. 9636737292 …

Comptroller and Auditor General of Indiaaghr.cag.gov.in/RESULT.pdf · neeraj saroha sumit kumar kuldeep kumar sharma j pankaj kumar pawan kumar nitesh kumar anu] kumar prashant kumar

Dr. Pawan Kumar Goyal; Professor & Head€¦ · Dr. Pawan Kumar Goyal; Professor & Head 1. Stress and periodontal disease – an insight into the association,world research journal

· PDF filebabu singh meenakshi jain meenakshi jain pawan kumar pawan kumar kalash manraj meena manraj meena sangeeta kumari sharma sangeeta kumari sharma mohd arif

Speech of Sh. Pawan Kumar Bansal introducing the Railway …€¦ · Speech of Sh. Pawan Kumar Bansal introducing the Railway Budget, 2013-14 26th February 2013 1. Madam Speaker,

c 2007 Pawan Kumar Aurora - CSE - IIT Kanpur · 2014. 10. 20. · Pawan Kumar Aurora August 2007 Chair: Timothy Davis Major: Computer Engineering Graph partitioning is an important

Pawan Kumar Department of Mathematics and Computer Science ... · Pawan Kumar 1 1Arnimallee 6 1Department of Mathematics and Computer Science 1Freie Universit at Berlin 1Berlin 14195,

Assignment by Pawan Kumar on Advertising

Automatically generated PDF from existing images.tech)to-DM...PAWAN KHICHI SUNITA TOPNO VINAY KUMAR CHAUDHARY GREEN KUMAR NIRMAL KUMAR PRAKASH KUMAR SINGH SMITA KUMARI MANOJ KUMAR

rajswasthya.nic.inrajswasthya.nic.in/49 Dt. 25.02.2015 On Date 09.03.2015 Website.pdf · Shyam Raj Sharma Mukesh Kumar Mukesh Kumar Gajendra Bhati Kiran Kumar Pawan Kumar Narania

DR. PAWAN KUMAR MAURYA - Central University of … Pawan Maurya.pdf1 DR. PAWAN KUMAR MAURYA Associate Professor &Head, Department of Biochemistry Deputy Director, Deen Dayal Upadhyay

Max-Margin Latent Variable Models M. Pawan Kumar

Pawan kumar sharma MSc Interior Design ( Lighting Project)

Self-Paced Learning for Semantic Segmentation M. Pawan Kumar

Ct. Case No. 3/2019 Narayan Kumar Vs. Pawan …Ct. Case No. 3/2019 Narayan Kumar Vs. Pawan Kumar Meena & Ors. 04.08.2020 Matter is taken up through Video Conferencing (Cisco Webex),