of 64/64

Learning Object Detectors From Weakly Supervised Image Data

  • View

  • Download

Embed Size (px)


One of the fundamental challenges in automatically detecting and localizing objects in images is the need to collect a large number of example images with annotated object locations (bounding boxes). The introduction of detection challenge datasets has propelled progress by providing the research community with enough fully annotated images to train competitive detectors for 20-200 classes. However, as we look forward towards the goal of scaling our systems to human level category detection, it becomes impractical to collect a large quantity of bounding box labels for tens, or even hundreds of thousands of categories. In this talk I will discuss recent work on enabling the training of detectors with weakly annotated images, i.e. images that are known to contain the object but with unknown object location (bounding box). The first approach I will present proposes a new multiple instance learning method for object detection that is capable of handling noisy automatically obtained annotations. Our approach consists of first obtaining confidence estimates over the label space and second incorporating these estimates within a new Boosting procedure. We demonstrate the efficiency of our procedure on two detection tasks, namely horse detection and pedestrian detection, where the training data is primarily annotated by a coarse area of interest detector, and show substantial improvements over existing MIL methods. I will also present a second, complimentary approach--a domain adaptation algorithm which learns the difference between the classification task and the detection task, and transfers this knowledge to classifiers for categories without bounding box annotated data, adapting them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data in Imagenet. The approach is evaluated on the ImageNet LSVRC-2013 detection challenge.

Text of Learning Object Detectors From Weakly Supervised Image Data

  • : Kate Saenko, University of Massachusetts, Lowell
  • COMPUTER VISION: LEARNING TO DETECT OBJECTS Kate Saenko, University of Massachusetts, Lowell
  • What is computer vision?2
  • Computer Vision 3 Terminator 2 were not quite there yet, but. terminator 2, enemy of the state (from UCSD Fact or Fiction DVD)
  • Machine Learning: What is it? Program a computer to learn from experience Learn from big data
  • Machine Learning in practice
  • Machine learning is not perfect 6
  • Machine learning is not perfect 7
  • Personal photo albums Lots of image data available!
  • Data for computer vision
  • What are applications of computer vision? 10
  • Surveillance and security Computer Vision: Surveillance and Security
  • Smart cars Mobileye Vision systems currently in high-end BMW, GM, Volvo models By 2010: 70% of car manufacturers Slide content courtesy of Amnon Shashua
  • Scientific Images
  • Medical Imaging Image guided surgery Grimson et al., MIT 3D imaging MRI, CT slide by S. Seitz
  • Vision for Robotics http://www.robocup.org/NASAs Mars Spirit Rover http://en.wikipedia.org/wiki/Spirit_rover slide by S. Seitz
  • Object Detection: Face Detection Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001
  • What is object detection?17
  • Goal of object detection 18 Detect: PERSON
  • Why is object detection difficult? 19
  • Why is object detection difficult? 20 Can you detect all objects in this image?
  • Easy to collect data on the web! 21
  • Difficult to label image annotations 22 Easy to label from search engine Much more difficult and costly to label dog apple dog apple
  • Goal of this research: 23 Learn from weakly labeled data!
  • How well can we do without bounding box labels?24 Computer detecting pedestrians
  • 25 Computer detecting 7,000 object categories How well can we do without bounding box labels?
  • Join work with Karim Ali Confidence-rated Multiple instance Boosting for Detection
  • Motivation 27 Object Detection High accuracy requires large labeled data sets Scalability Reducing annotation requirements Semi-supervised Learning Active Learning Multiple-Instance Learning
  • Overview 28 CR- MILBOOST
  • Multiple instance learning with noise 29 MI Learning cannot handle noisy bags
  • Outline 30 Reminder: What is MIL? CR-MILBoost (CVPR14) Conclusion & Future Work Discussion
  • Reminder: What is MIL? 31 Supervised Learning Each instance has an associated label MIL: Weaker Supervision Examples come in bags Each Bag has a label Negative Bag: all instances in bag are negative Positive Bag: at least one instance in bag is positive
  • Supervised vs MIL (binary) 32 Supervised Learning MI Learning xi, yi( ) RD -1,1{ } Xi = xi1, , xiK{ }, yi( ) RD ( ) K -1,1{ } j x( )> 0 if y = +1 j x( )< 0 if y = -1 max j j(xij ) > 0 if yi = +1 max j j(xij ) < 0 if yi = -1 j* x( )= argmin j x( ) L j;x, y( ) j* x( )= argmin j x( ) L j;X, y( )
  • Related Methods 33 How to estimate latent labels for positives Gartner, ICML02 Xi = 1 N xij Xu, ICML04 j(Xi ) = 1 N j(xij ) Andrews, NIPS03 j(Xi )= max j j(xij ) Bunescu, ICML07 SVM Constraints Viola, NIPS07 pi =1-j (1- pij ) Supervised MIL
  • CR-MILBOOST 34 j* (x) = argmin pi ti (1- pi )1-ti pij = 1 1+e -j xij( ) pi =1-j (1- pij ) MILBoost
  • CR-MILBOOST 35 j* (x) = argmin pi ti (1- pi )1-ti MILBoost wij = yi - pi pi pij j(x) = akhk (x) k
  • CR-MILBOOST 36 Two Step Procedure Estimate Probabilities on latent label Integrate estimate in new loss Mitigates label estimation error by incorporating priors
  • CR-MILBOOST 37 Q = j1 x( ),j2 x( ), ,jq x( ){ } hij P yij = yi Q( )= 1 1+e -yi jq xij( ) hi P yi Q( )= max j hij Step 1
  • CR-MILBOOST 38 Step 2 j* (x) = argmin pi ti (1- pi )1-ti pij = 1 1+e -j xij( ) pi =1-j (1- pij ) hij hi
  • CR-MILBOOST 39 Step 2 wij = yi - pi hi pi hij pij j(x) = akhk (x) k
  • Experiments: Features 40 h* e,R (x) = xe (x,m) mR xd (x,m) dF,mR Weak Learners: An edge orientation A sub-window A threshold e,R,t( ) Simple, Efficient Q=4, number of stumps f x( ) = akhk x( ) k
  • Experiments: Pedestrian Detection 41 Training Data 200 images automatically downloaded from the web 200 objectness bounding boxes
  • Experiments: Pedestrian Detection 42 Testing Data INRIA Person 300 images containing 600 pedestrians
  • Experiments: Pedestrian Detection 43
  • Experiments: Pedestrian Detection 44
  • Experiments: Pedestrian Detection 45
  • Experiments: Horse Detection 46 Training Data 200 images automatically downloaded from the web 200 objectness bounding boxes
  • Experiments: Horse Detection 47 Testing Data 200 images containing 200 side-view horses
  • Experiments: Horse Detection 48
  • Experiments: Horse Detection 49
  • Experiments: Horse Detection 50
  • Conclusion 51 New MIL method: CR-MILBOOST Two step procedure Dramatic increase in performance 200% on two datasets Quality of selected examples still suffer from additional ambiguity when compared to the fully supervised examples
  • Joint work with Judy Hoffman, Eric Tzeng, Sergio Guadarrama and Trevor Darrell at UC Berkeley Adapting Deep CNNs from Classification to Detection 53
  • Recall: classification is easier than detection 54 Classification label: Easy to label Detection label: much more difficult and costly! dog apple dog apple
  • ICLASSIFY dogapple I DET dog apple ICLASSIFY cat W CLASSIFY dog W CLASSIFY apple Classifiers WDET dog WDET apple Detectors WCLASSIFY cat WDET cat IDET ? Main idea behind the approach
  • cat: 0.90 dog: 0.85 airplane: 0.05 person: 0.10 layers 1-5 fc6 fc7 fcA fcB Classification data from categories A and B Train Classification CNN Deep Convolutional Neural Network
  • dog: 0.87 person: 0.15 cat: 0.90 dog: 0.85 background: 0.25 airplane: 0.05 person: 0.10 layers 1-5 det layers 1-5 fc6 det fc6 fc7 det fc7 fcA fcB det fcB Classification data from categories A and B Train Classification CNN Detection data from categories B Labeled warped region Train adapted detection CNN dog background background: 0.25 det layers 1-5 det fc6 det fc7 Final Combined and fully adapted CNN cat: 0.90 airplane: 0.02det fcA dog: 0.45 person: 0.15 det fcB adapt background (c) Output Layer Adaptation (a)ClassificationCNN (b) Hidden Layer Adaptation
  • Results on ILSVRC 2013 Detection
  • Results on ILSVRC 2013 Detection
  • Results on ILSVRC 2013 Detection
  • Preliminary results on 7K categories 62
  • Conclusion 63 Presented two new methods for object detector training with minimal bounding box annotation MIL based method for learning from results of image search Adaptation from classification to detection task
  • Questions? 64