61
CS558 COMPUTER VISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebni

CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

Embed Size (px)

Citation preview

Page 1: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

CS558 COMPUTER VISIONLecture XI: Object Recognition (2)

Slides adapted from S. Lazebnik

Page 2: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

OUTLINE

Object recognition: overview and history Object recognition: a learning based approach Object recognition: bag of features representation Object recognition: classification algorithms

Page 3: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

BAG-OF-FEATURES MODELS

Page 4: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

ORIGIN 1: TEXTURE RECOGNITION

Texture is characterized by the repetition of basic elements or textons

For stochastic textures, it is the identity of the textons, not their spatial arrangement, that matters

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

Page 5: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

ORIGIN 1: TEXTURE RECOGNITION

Universal texton dictionary

histogram

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

Page 6: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

ORIGIN 2: BAG-OF-WORDS MODELS Orderless document representation: frequencies of

words from a dictionary Salton & McGill (1983)

Page 7: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

ORIGIN 2: BAG-OF-WORDS MODELS

US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/

Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)

Page 8: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

ORIGIN 2: BAG-OF-WORDS MODELS

US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/

Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)

Page 9: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

ORIGIN 2: BAG-OF-WORDS MODELS

US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/

Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)

Page 10: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

1. Extract features

2. Learn “visual vocabulary”

3. Quantize features using visual vocabulary

4. Represent images by frequencies (counts) of “visual words”

BAG-OF-FEATURES STEPS

Page 11: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

1. FEATURE EXTRACTION

Regular grid or interest regions

Page 12: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

Normalize patch

Detect patches

Compute descriptor

Slide credit: Josef Sivic

1. FEATURE EXTRACTION

Page 13: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

1. FEATURE EXTRACTION

Slide credit: Josef Sivic

Page 14: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

2. LEARNING THE VISUAL VOCABULARY

Slide credit: Josef Sivic

Page 15: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

2. LEARNING THE VISUAL VOCABULARY

Clustering

Slide credit: Josef Sivic

Page 16: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

2. LEARNING THE VISUAL VOCABULARY

Clustering

Slide credit: Josef Sivic

Visual vocabulary

Page 17: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

K-MEANS CLUSTERING• Want to minimize sum of squared Euclidean

distances between points xi and their nearest cluster centers mk

Algorithm:• Randomly initialize K cluster centers• Iterate until convergence:

Assign each data point to the nearest center Recompute each cluster center as the mean of all

points assigned to it

k

ki

ki mxMXDcluster

clusterinpoint

2),(

Page 18: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

CLUSTERING AND VECTOR QUANTIZATION

• Clustering is a common method for learning a visual vocabulary or codebook Unsupervised learning process Each cluster center produced by k-means becomes

a codevector Codebook can be learned on separate training set Provided the training set is sufficiently

representative, the codebook will be “universal”

• The codebook is used for quantizing features A vector quantizer takes a feature vector and maps

it to the index of the nearest codevector in a codebook

Codebook = visual vocabulary Codevector = visual word

Page 19: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

EXAMPLE CODEBOOK

Source: B. Leibe

Appearance codebook

Page 20: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

ANOTHER CODEBOOK

Appearance codebook…

………

Source: B. Leibe

Page 21: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

Yet another codebook

Fei-Fei et al. 2005

Page 22: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

VISUAL VOCABULARIES: ISSUES

• How to choose vocabulary size? Too small: visual words not representative of all

patches Too large: quantization artifacts, overfitting

• Computational efficiency Vocabulary trees

(Nister & Stewenius, 2006)

Page 23: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

SPATIAL PYRAMID REPRESENTATION Extension of a bag of features Locally orderless representation at several levels of resolution

level 0

Lazebnik, Schmid & Ponce (CVPR 2006)

Page 24: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

Extension of a bag of features Locally orderless representation at several levels of resolution

SPATIAL PYRAMID REPRESENTATION

level 0 level 1

Lazebnik, Schmid & Ponce (CVPR 2006)

Page 25: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

Extension of a bag of features Locally orderless representation at several levels of resolution

SPATIAL PYRAMID REPRESENTATION

level 0 level 1 level 2

Lazebnik, Schmid & Ponce (CVPR 2006)

Page 26: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

SCENE CATEGORY DATASET

Multi-class classification results(100 training images per class)

Page 27: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

CALTECH101 DATASET

http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html

Multi-class classification results (30 training images per class)

Page 28: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

BAGS OF FEATURES FOR ACTION RECOGNITION

Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei, Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words, IJCV 2008.

Space-time interest points

Page 29: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

BAGS OF FEATURES FOR ACTION RECOGNITION

Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei, Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words, IJCV 2008.

Page 30: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

IMAGE CLASSIFICATION

• Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Page 31: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

OUTLINE

Object recognition: overview and history Object recognition: a learning based approach Object recognition: bag of features representation Object recognition: classification algorithms

Page 32: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

CLASSIFIERS Learn a decision rule assigning bag-of-features

representations of images to different classes

Zebra

Non-zebra

Decisionboundary

Page 33: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

CLASSIFICATION Assign any input vector to one of two or more

classes Any decision rule divides an input space into

decision regions separated by decision boundaries

Page 34: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

NEAREST NEIGHBOR CLASSIFIER

Assign the label of the nearest training data point to each test data point

Voronoi partitioning of feature space for two-category 2D and 3D data

from Duda et al.

Source: D. Lowe

Page 35: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

For a new point, find the k closest points from a training data set.

The labels of the k points “vote” to classify the new point. Works well provided that there is lots of data and the

distance/similarity function is good.

k-Nearest Neighbors

k = 5

Source: D. Lowe

Page 36: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

FUNCTIONS FOR COMPARING HISTOGRAMS

• L1 distance:

• χ2 distance:

• Quadratic distance (cross-bin distance):

• Histogram intersection (similarity function):

N

i

ihihhhD1

2121 |)()(|),(

N

i ihih

ihihhhD

1 21

221

21 )()(

)()(),(

ji

ij jhihAhhD,

22121 ))()((),(

N

i

ihihhhI1

2121 ))(),(min(),(

Page 37: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

LINEAR CLASSIFIERS

• Find a linear function (hyperplane) to separate positive and negative examples

0:negative

0:positive

b

b

ii

ii

wxx

wxx

Which hyperplaneis best?

Page 38: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

SUPPORT VECTOR MACHINES

• Find hyperplane that maximizes the margin between the positive and negative examples

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Page 39: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

SUPPORT VECTOR MACHINES

• Find a hyperplane that maximizes the margin between the positive and negative examples

1:1)(negative

1:1)( positive

by

by

iii

iii

wxx

wxx

MarginSupport vectors

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Distance between point and hyperplane: ||||

||

w

wx bi

For support vectors, 1 bi wx

Therefore, the margin is 2 / ||w||

Page 40: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

FINDING THE MAXIMUM MARGIN HYPERPLANE

1. Maximize the margin 2/||w||

2. Correctly classify all training examples:

Quadratic programming problem:

Minimize

Subject to yi(w·xi+b) ≥ 1

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

wwT

2

1

1:1)(negative

1:1)( positive

by

by

iii

iii

wxx

wxx

Page 41: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

FINDING THE MAXIMUM MARGIN HYPERPLANE• Solution:

i iii y xw

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Support vector

learnedweight

Page 42: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

FINDING THE MAXIMUM MARGIN HYPERPLANE

• Solution:

b = yi – w·xi for any support vector

• Classification function (decision boundary):

• Notice that it relies on an inner product between the test point x and the support vectors xi

• Solving the optimization problem also involves computing the inner products xi · xj between all pairs of training points

i iii y xw

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

bybi iii xxxw

Page 43: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

• Datasets that are linearly separable work out great:

• But what if the dataset is just too hard?

• We can map it to a higher-dimensional space:

0 x

0 x

0 x

x2

NONLINEAR SVMS

Slide credit: Andrew Moore

Page 44: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

• General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

NONLINEAR SVMS

Slide credit: Andrew Moore

Page 45: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

• The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that

K(xi , xj) = φ(xi ) · φ(xj)

(To be valid, the kernel function must satisfy Mercer’s condition)

• This gives a nonlinear decision boundary in the original feature space:

NONLINEAR SVMS

bKybyi

iiii

iii ),()()( xxxx

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

Page 46: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

NONLINEAR KERNEL: EXAMPLE

• Consider the mapping ),()( 2xxx

22

2222

),(

),(),()()(

yxxyyxK

yxxyyyxxyx

x2

Polynomial kernel

Page 47: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

KERNELS FOR BAGS OF FEATURES

• Histogram intersection kernel:

• Generalized Gaussian kernel:

• D can be L1 distance, Euclidean distance, χ2 distance, etc.

• A kernel function must be a similarity function.

N

i

ihihhhI1

2121 ))(),(min(),(

2

2121 ),(1

exp),( hhDA

hhK

J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007

Page 48: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

SUMMARY: SVMS FOR IMAGE CLASSIFICATION

1. Pick an image representation (in our case, bag of features)

2. Choose a kernel function for that representation3. Compute the matrix of kernel values between all

pairs of training examples4. Feed the kernel matrix into your favorite SVM

solver to obtain support vectors and weights5. At test time: compute kernel values between your

test example and all support vectors, and combine them with the learned weights to yield the value of the final decision function

Page 49: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

WHAT ABOUT MULTI-CLASS SVMS?

• Unfortunately, there is no “definitive” multi-class SVM formulation

• In practice, we have to obtain a multi-class SVM by combining multiple binary (two-class) SVMs

• One vs. others Training: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign

to it the class of the SVM that returns the highest decision value

• One vs. one Training: learn an SVM for each pair of classes Testing: each learned SVM “votes” for a class to

assign to the test example

Page 50: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

SVMS: PROS AND CONS

• Pros Many publicly available SVM packages:

http://www.kernel-machines.org/software Kernel-based framework is very powerful, flexible SVMs work very well in practice, even with very

small training sample sizes

• Cons No “direct” multi-class SVM, must combine two-

class SVMs Computation, memory

During training time, must compute a matrix of kernel values between all pairs of examples ( training time >= O(n^2) )

Training SVMs can take a very long time for large-scale problems (actually kernel SVMs can typically handle ~100K training examples on a single machine)

Page 51: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

SUMMARY: CLASSIFIERS

• Nearest-neighbor and k-nearest-neighbor classifiers L1 distance, χ2 distance, quadratic distance, histogram

intersection• Support vector machines

Linear classifiers Margin maximization The kernel trick Kernel functions: histogram intersection, generalized

Gaussian, pyramid match Multi-class

• Of course, there are many other classifiers out there Neural networks, boosting, decision trees, … Deep neural networks, convolutional neural networks:

jointly learning feature representation and classifiers

Page 52: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

PROJECT #4Visual Recognition Challenge:

Page 53: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

DATA SET:

Graz02 Bike, Cars, Person, None (background)

Page 54: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

TRAIN/VALIDATION/TEST

Use the Train dataset to train the classifiers for recognizing the three object categories;

Use the validation dataset to tune any parameters;

Fix the parameters you tuned and report your final classification accuracy on the test images.

Page 55: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

FEATURE EXTRACTION

Sample fixed-size image patches on regular grid of the image. You can use a single fixed image patch size, or have several different sizes.

Sample random-size patches at random locations

Sample fixed-size patches around the feature locations from, e.g., your own Harris corner detector or any other feature detector you found online. Note if you used code you downloaded online, please cite it clearly in your report.

Page 56: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

FEATURE DESCRIPTION

It is recommended that you implement the SIFT descriptor described in the lecture slides for each image patch you generated. But you may also use other feature descriptors, such as the raw pixel values (bias-gain normalized). You don’t need to make your SIFT descriptor to be rotation invariant if you don’t want to do so.

Page 57: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

DICTIONARY COMPUTATION

Run K-means clustering on all training feature descriptors to learn the dictionary. Setting K to be 500-1000 should be sufficient. You may experiment with different sizes.

Page 58: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

IMAGE REPRESENTATION

Given the features extracted from any training/validation/test image, quantize each feature to its closest codevector in the dictionary. For each image, you can accumulate a histogram by counting how many features are quantized to each code in the dictionary. This is your image representation (bag-of-features).

Page 59: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

CLASSIFIER TRAINING

Given the histogram representation of all images, you may use a k-NN classifier or any other classifier you wish to use. If you wanted to push for recognition, you mean consider training an SVM classifier. Please feel free to find any SVM library to train your classifier. You may use the Matlab Toolbox provided in this link (http://theoval.cmp.uea.ac.uk/~gcc/svm/toolbox/).

Page 60: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

RECOGNITION QUALITY

After you train the classifier, please report recognition accuracy on the test image set. It is the ratio of images you recognize correctly. Please also generate the so-called confusion matrix to enumerate, e.g., how many images you mis-classify from one category to another category. You will need to report results from both the validation and test datasets to see how well your algorithm generalizes to test data.

Page 61: CS558 C OMPUTER V ISION Lecture XI: Object Recognition (2) Slides adapted from S. Lazebnik

IMPORTANT NOTE

Please follow the train/validation/test protocol and don’t heavily tune parameters on the test dataset. You will be asked to report recognition accuracy on both the validation and test datasets. Tuning parameters on the test dataset is considered to be cheating, and not allowed.