ECS 289G: Visual...

ECS 289G: Visual RecognitionWinter 2018

Yong Jae LeeDepartment of Computer Science

Plan for today

• Questions?• Sign-up for paper• Research overview

Success in modern visual recognition research

Image classification

Object detectionPose recognition

Semantic segmentation

… and many more

Ingredients for success today1. Fast computation (GPUs)

2. Large labeled image dataset

3. Network architecture

Which ingredient will be the bottleneck fortomorrow’s success?

Ingredients for success today

2. Large labeled image dataset

1. Fast computation (GPUs)

Requires expensive, detailed human supervision

Take semantic segmentation as an example…

Requires pixel-level semantic labels

70,000+ annotation hours for 328K images but only 80 object categories (MS COCO)

Ingredients for success today

2. Large weakly-labeled image dataset

1. Fast computation (GPUs)

Fully-supervised vs. Weakly-supervised

{Person, umbrella, hat, … }

Fully-supervised:Pixel-level labels

Weakly-supervised:Image-level labels

Overview: Weakly-supervised visual recognition

1. Learning object detectors with tags

2. Learning visual attributes with pairwise comparisons

3. Learning visual grounding with captions

“clock toweris tall”

Weakly-supervised object detection

Annotators

Mine discriminative

patchescar

[Weber et al. 2000, Pandey & Lazebnik 2011, Deselaers et al. 2012, Song et al. 2014, …]

no car

Annotators

Mine discriminative

patchescar

no carWeakly-labeled training images

Annotators

Mine discriminative

patchescar

Weakly-labeled training images

no car

• Supervision is provided at the image-level scalable!

Annotators

Mine discriminative

patchescar

no car

• Supervision is provided at the image-level scalable!• Due to intra-class appearance variations, occlusion, clutter, mined

regions correspond to object-part or include background

Annotators

Mine discriminative

patchescar

no car

Our idea: Track and transferWeakly-labeled videos tagged with “car”

Weakly-labeled training images tagged with “car”

…[Singh, Xiao, Lee, “Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection”, CVPR 2016]

Transferring tracked object boundaries

Mined positive image region

[Unsupervised video object proposals, Xiao and Lee, CVPR 2016]

• For each video frame, search over scales and locations

match retrieve

matchretrieve

Discovered pseudo ground-truth boxes

• Handles atypical pose, partial occlusion, and clutter

Weakly-supervised object detection accuracy

• Significantly outperformed existing weakly-supervised detection methods (at the time)

Attribute: Smile

Goal of weakly-supervised attribute localization

• Localizing the relevant region improves attribute modeling• In the relative setting, we are provided with pairwise comparisons

Previous approaches• Require pre-trained part detectors or crowdsourcing [Kumar et al.

ICCV 2009, Bourdev et al. ICCV 2011, Duan et al. CVPR 2012, Zhang et al. CVPR 2014]

• “Pipeline” where features, localizer, and classifier are trained separately and sequentially; suboptimal and slow [Xiao and Lee, ICCV 2015]

• Our idea: jointly learn features, localizer, and classifier end-to-end in a deep learning framework

• We focus on the relative attribute setting

Overview of our end-to-end approach

[Singh and Lee, “End-to-End Localization and Ranking for Relative Attributes”, ECCV 2016]

Loss Function

Localization Network

Ranker Network

Siamese Network (S1)

Localization Network

Ranker Network

Siamese Network (S2)

Attribute: Smile

• Goal: Given pairs of ordered training images, simultaneously localize attribute in each image and learn a ranker

Our end-to-end approach

96256 384 384 384 128 3128

θ Grid generator

Ranker Network

96256 384 384 384 4096 4096

1Localization Network

• Ranker network takes the localized region to produce a ranking score• Combine the global image for global context

Attribute:Smile

Attribute: Dark hair

Training epochs

• Heatmap: distribution of localized region across entire training dataset

Progression of localized region over training epochs

VtestLocalization

Network Ranker Network

Siamese (S1)

Testing

• Localize the relevant attribute region• Produce a ranking score for the test image

Test image

Results: Discovered regions and ranking on LFW-10 FacesWeak Strong

Darkhair

Eyesopen

• Our network discovers relevant attribute regions• Leads to accurate rankings

Weak Strong

Pointy

Sporty

Comfort

Results: Discovered regions and ranking UT-Zap50K Shoes

Results: Image pair ranking accuracy

• % of test image pairs whose predicted relative attribute ranking is correct• State-of-the-art results on LFW-10, UT-Zap50K, OSR, Shoe-with-Attribute

planecow

image + tag

Bounding box detections

Deselaers et al. 2010, Pandey & Lazebnik 2011, Song et al. 2014, Singh et al. 2016, ...

Localizingobjects

Tag-based weakly supervised localization

Xu et al. 2014, Pathak et al. 2015, Pinheiro & Collobert 2015, ...

Semantic segmentation

Tag-based weakly supervised localization

Tags vs. natural language

Tags: "cat", "hand", "donut"

Tags convey very limited amount of information about images

Tags: "cat", "hand", "donut"

Caption: "A grey cat staring at a hand with a donut."

Much more informative and natural to describe images with caption

Caption: "A grey cat staring at a hand with a donut."

A grey cat staring at a hand with a donut.

Our goal: Leverage structure in natural language forweakly supervised visual grounding

[Xiao, Sigal, Lee, “Weakly-supervised Visual Grounding of Phrases with Linguistic Structures”, CVPR 2017]

Visual grounding of free-from language with only image-level supervision

• Exploit linguistic structure in caption• Parse caption into syntactic parse tree [Socher et al. 2013]

(1) Implies sibling exclusivity: Image regions for sibling nodes should be exclusive

Key idea: Transfer linguistic structure to visual domain

(2) Implies parent-child inclusivity: Image region of parent node is union of children's regions

Key idea: Transfer linguistic structure to visual domain

Four submodules: (1) visual encoder, (2) language encoder,(3) semantic embedding module, and (4) loss functions.

Network architecture

Phrase grounding on Visual Genome

Left: our modelRight: baseline trained without structural constraints (Discriminative loss only)

the train is on the bridge a person driving a boat a jockey riding a horse

Multiple queries on Visual Genomesnow covered

parkbench is black

and whiteclock tower

is tallbuildings by

streetthe walls are dark purple

yellow and orange cat

Our groundings generated for different phrases of the same image

Multiple queries on MS COCOpizza personbottle keyboardbus sheep

Our groundings generated for different tags of the same image

Phrase grounding on Visual Genome

Animal keypoint detection

Input Output

Motivation

• Veterinary research has found facial expressions of pain for horses, sheep, and cows to be a good indicator of pain

K. B. Gleerup, B. Forkman, C. Lindegaard, and P. H. Andersen. An equine pain face. Veterinary anesthesia and analgesia, 42(1):103–114, 2015

Motivation

• Automatic pain detection can have a huge impact on animal welfare– $0.5 billion spent annually in the US to treat

lameness in horses

• Human detection of animal pain is challenging– Humans need to be trained to detect pain– Livestock may hide expressions of pain near

humans

Key idea• Leverage existing human training data for

animal facial keypoint detection (with a small amount of animal training data)

• Warp animal faces to resemble human shape

Maheen Rashid, Xiuye Gu, Yong Jae Lee. Interspecies Knowledge Transfer. CVPR 2017.

Approach

• Warping network warps animal to have human-like shape

• Keypoint network predicts facial keypointson warped animal face

Experiments

• Horse Facial Keypoint Dataset – 3531 training, 186 testing images– Left eye, right eye, nose center, left mouth corner,

right mouth corner

• Sheep Facial Keypoint Dataset [Yang et al. 2016]

– 432 training, 99 testing Images– Same 5 keypoints as above

Qualitative resultsGround truth Our warp Our pred No warp

Qualitative results

Yang et al. 2016

• By leveraging existing human training data, our CNN model performs much better

Quantitative results

• Triplet Interpolated Features [Yang et al. 2016]

Horse Sheep

Conclusions

• Weakly-supervised learning enables scalable visual recognition- Object detection, attribute recognition, visual grounding

• Still, challenges remain in dealing with- Noise in data- Limited training data

Coming up• Sign-up for paper• Next class: CNN/PyTorch/Torch tutorial

ECS 289G: Visual...

Documents

Recurrent Neural Network Based Language Modelweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/vicky2.pdf · • Recurrent Neural Network based language model ... speech recognition,

DeCAF: A Deep Convolutional Activation Feature for Generic ...yjlee/teaching/ecs289g-fall2015/decaf.pdf · with deep convolutional neural networks." Advances in neural information

Fast R-CNNweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/charlie1.pdf•Fine-tuning conv 3 and up. Girshick. Fast R-CNN. Fast R-CNN: Discussions •Which layers to fine-tune?

ECS289: Scalable Machine Learning - Statisticschohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 6, ... Primal-dual relationship: w = P n

SSD:Single Shot MultiBox Redd, Cheng-Yang Fu, Alexander C ...yjlee/teaching/ecs289g-winter2018/SSD.pdfSSD:Single Shot MultiBox Detector Presenter: Hongjing Zhang, Chen Zhang Wei Liu,

Semantic Image Segmentation with Deep ICLR 2015 ...yjlee/teaching/ecs289g-winter2018/DeepLab.pdfImage classification Object localization Semantic segmentation [1] Garcia-Garcia et

“Unsupervised Learning of Visual Representations by ...yjlee/teaching/ecs289g-fall2016/chenshan.pdfRepresentations by Solving ... - Problem: Self-supervised learning CNN - Solution:

Deep Neural Networks are Easily Fooled: High Confidence ...yjlee/teaching/ecs289g-fall2015/charlie2.pdfECS 289G 001 Paper Presentation, Oct 22nd Nguyen A, Yosinski J, Clune J. Deep

Countering Adversarial Images using Input Transformations Hari …yjlee/teaching/ecs289g... · 2018. 2. 12. · x is the original input, x’ is the perturbed input which is found

Fully Convolutional Networks for Semantic Segmentation [1] › ~yjlee › teaching › ecs289g... · Fully Convolutional Networks for Semantic Segmentation [1] Jonathan Long, Evan

Fully Convolutional Networks for Semantic Segmentation [1]web.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2015/...References [1] Long, Jonathan, Evan Shelhamer, and Trevor Darrell

· L ESFORSE C GCAFÑ 01 GINA 01 233-798 cm 233-797 233-944 Visual Visual Visua Visua Visual visual BLANCO TODO Visual Visual Visual Visual visual Adheida en te ido unto 100% oliêstêr

Object Detectors Emerge in Deep Scene CNNsweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2016/collin.pdf · Objects classes shared across scene categories Therefore object detectors

Web viewand section 289G of the M2Criminal Procedure (Scotland) Act 1975 (corresponding Scottish provision), for subsection (2) there shall be substituted the

ECS$289G$–$UC$Davis$ Paper$Presenta6on$#1web.cs.ucdavis.edu/.../ecs289g-fall2015/Presentation_MM.pdf · 2015. 10. 2. · Mohammad’Motamedi’ ECS’289G’PAPERPRESENTATION’9’UC’DAVIS

“Unsupervised Learning of Visual Representations by ...web.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2016/chenshan.pdfTrain CFN:Jigsaw Puzzle Task - Jigsaw Puzzle Permutations -

SSD:Single Shot MultiBox Redd, Cheng-Yang Fu, Alexander ...web.cs.ucdavis.edu/~yjlee/teaching/ecs289g-winter2018/...SSD:Single Shot MultiBox Detector Presenter: Hongjing Zhang, Chen

Look, Listen and Learn Relja Arandjelović, Andrew Zissermanyjlee/teaching/ecs289g-winter2018/LLL.… · Greg Rehm, Shahbaz Rezaei Look, Listen and Learn Relja Arandjelović, Andrew

ImageNet Classification with Deep Convolutional Neural ...web.cs.ucdavis.edu/~yjlee/teaching/ecs289g-fall2016/yuanzhe.pdf · ImageNet Classification with Deep Convolutional Neural

Deep Residual Learning for Image Recognitionyjlee/teaching/ecs289g... · Deep Residual Learning for Image Recognition Wei-Pang Jan, Xuanqing Liu * Most of the figures/tables credit