Fcv rep darrell

Learning visual representations

for unfamiliar environments

Kate Saenko, Brian Kulis,

Trevor Darrell

UC Berkeley EECS & ICSI

The challenge of large scale visual interaction

Last decade has proven the superiority of models learned from data vs. hand engineered structures!

• “Unsupervised”: Learn models from “found data”;

often exploit multiple modalities (text+image)

Large-scale learning

… The Tote is the perfect example of

two handbag design principles that ...

The lines of this tote are incredibly

sleek, but ... The semi buckles that form the handle attachments are ...

E.g., finding visual senses

4

Artifact sense: “telephone” DICTIONARY

1: (n)

telephone, phone, telepho

ne set (electronic

equipment that converts

sound into electrical

signals that can be

transmitted over distances

and then converts received

signals back into sounds)

2: (n)

telephone, telephony

(transmitting speech at a

distance)

[Saenko and Darrell ’09]

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephone&i=0&h=000

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=phone

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephone+set

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephone+set

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephone&i=1&h=000

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephony

• “Unsupervised”: Learn models from “found data”;

often exploit multiple modalities (text+image)

• Supervised: Crowdsource labels (e.g., ImageNet)

Large-scale Learning

… The Tote is the perfect example of

two handbag design principles that ...

The lines of this tote are incredibly

sleek, but ... The semi buckles that form the handle attachments are ...

Yet…

• Even the best collection of images from the web and

strong machine learning methods can often yield poor

classifiers on in-situ data!

• Supervised learning assumption: training distribution

== test distribution

• Unsupervised learning assumption: joint distribution is

stationary w.r.t. online world and real world

Almost never true! 6

?

“What You Saw Is Not What You Get”

The models fail due to domain shift

SVM:54%NBNN:61%

SVM:20%NBNN:19%

Close-up Far-away

amazon.comConsumer images

FLICKR CCTV

Examples of visual domain shifts

digital SLR webcam

Examples of domain shift:

change in camera, feature type, dimension

digital SLR webcam

SURF

VQ to 300

SIFT

VQ to 1000

Different dimensions

Solutions?

• Do nothing (poor performance)

• Collect all types of data (impossible)

• Find out what changed (impractical)

• Learn what changed

Prior Work on Domain Adaptation

• Pre-process the data [Daumé ’07] : replicate

features to also create source- and domain-

specific versions; re-train learner on new features

• SVM-based methods [Yang’07], [Jiang’08],

[Duan’09], [Duan’10] : adapt SVM parameters

• Kernel mean matching [Gretton’09] : re-weight

training data to match test data distribution

Our paradigm: Transform-based

Domain Adaptation

Previous methods’ drawbacks

• cannot transfer learned shift

to new categories

• cannot handle new features

We can do both by learning

domain transformations*

Example: “green” and “blue” domains

W

* Saenko, Kulis, Fritz, and Darrell. Adapting visual category models to new domains. ECCV, 2010

Symmetric assumption fails!

Limitations of symmetric transforms

Saenko et al. ECCV10 used

metric learning:

• symmetric transforms

• same features

How do we learn more

general shifts?

W

Asymmetric transform (rotation)

Latest approach*: asymmetric transforms

• Metric learning model no

longer applicable

• We propose to learn

asymmetric transforms

– Map from target to source

– Handle different dimensions

*Kulis, Saenko, and Darrell, What You Saw is Not What You Get: Domain Adaptation Using Asymmetric Kernel Transforms, CVPR 2011

Asymmetric transform (rotation)

W

Latest approach: asymmetric transforms

• Metric learning model no

longer applicable

• We propose to learn

asymmetric transforms

– Map from target to source

– Handle different dimensions

Model Details

• Learn a linear transformation to map points

from one domain to another

– Call this transformation W

– Matrices of source and target:

W

Loss Functions

Choose a point x from the source and y from the target, and consider inner product:

Should be “large” for similar objects and “small” for dissimilar objects

Loss Functions

• Input to problem includes a collection of m

loss functions

• General assumption: loss functions depend

on data only through inner product matrix

Regularized Objective Function

• Minimize a linear combination of sum of loss

functions and a regularizer:

• We use squared Frobenius norm as a

regularizer

– Not restricted to this choice

The Model Has Drawbacks

• A linear transformation may be insufficient

• Cost of optimization grows as the product of

the dimensionalities of the source and target

data

• What to do?

Kernelization

• Main idea: run in kernel space

– Use a non-linear kernel function (e.g., RBF kernel)

to learn non-linear transformations in input space

– Resulting optimization is independent of input

dimensionality

– Additional assumption necessary: regularizer is a

spectral function

Kernelization

Original Transformation Learning Problem

Kernel matrices for source and target

New Kernel Problem

Relationship between original and new problems at optimality

Summary of approach

Input

space

Input

space

1. Multi-Domain Data 2. Generate Constraints, Learn W

3. Map via W 4. Apply to New Categories

Test point

Test pointy1 y2

Multi-domain dataset

Experimental Setup

• Utilized a standard bag-of-words model

• Also utilize different features in the target domain

– SURF vs SIFT

– Different visual word dictionaries

• Baseline for comparing such data: KCCA

Novel-class experiments

• Test method’s ability to transfer domain shift to unseen

classes

• Train transform on half of the classes, test on the other half

Our Method (linear)Our Method

Extreme shift example

Nearest neighbors in source using transformation

Query from target Nearest neighbors in source using KCCA+KNN

Conclusion

• Should not rely on hand-engineered features any more than we rely on hand engineered models!

• Learn feature transformation across domains

• Developed a domain adaptation method based on regularized non-linear transforms

– Asymmetric transform achieves best results on more extreme shifts

– Saenko et al ECCV 2010 and Kulis et al CVPR 2011; journal version forthcoming

Technology

Fcv rep darrell