OPTIMOL: automatic Online Picture collecTion via ...vision.stanford.edu/lijiali/JiaLi_files/LiWangFei-Fei...OPTIMOL: automatic Online Picture collecTion via Incremental MOdel Learning

OPTIMOL: automatic Online Picture collecTionvia Incremental MOdel Learning

Li-Jia Li1, Gang Wang1 and Li Fei-Fei2

1 Dept. of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, USA2 Dept. of Computer Science, Princeton University, USA

[email protected], [email protected], [email protected]

Abstract

A well-built dataset is a necessary starting point for ad-vanced computer vision research. It plays a crucial rolein evaluation and provides a continuous challenge to state-of-the-art algorithms. Dataset collection is, however, a te-dious and time-consuming task. This paper presents a novelautomatic dataset collecting and model learning approachthat uses object recognition techniques in an incrementalmethod. The goal of this work is to use the tremendous re-sources of the web to learn robust object category models inorder to detect and search for objects in real-world clutteredscenes. It mimics the human learning process of iterativelyaccumulating model knowledge and image examples. Weadapt a non-parametric graphical model and propose anincremental learning framework. Our algorithm is capa-ble of automatically collecting much larger object categorydatasets for 22 randomly selected classes from the Caltech101 dataset. Furthermore, we offer not only more images ineach object category dataset, but also a robust object modeland meaningful image annotation. Our experiments showthat OPTIMOL is capable of collecting image datasets thatare superior to Caltech 101 and LabelMe.

1. IntroductionType the word “airplane” in your favorite Internet search

image engine, say Google Image (or Yahoo!, flickr.com,etc.). What do you get? Of the thousands of images thesesearch engines return, only a small fraction would be con-sidered good airplane images (∼ 15%). It is fair to say thatfor most of today’s average users surfing the web for imagesof generic objects, the current commercial state-of-the-artresults are far from satisfying.

This problem is intimately related to the problem oflearning and modeling generic object classes, a topic thathas recently captured the attention of search engine de-velopers as well as vision researchers. However, in orderto develop effective object categorization algorithms, re-

ClassificationAccepted images

Re

ject

ed

ima

ge

s

Keyword: panda

Downloaded Web images

Incremental

learning

Category

Model

Collected Dataset (ongoing)

Figure 1. Illustration of the framework of the Online Picture col-lecTion via Incremental MOdel Learning (OPTIMOL) system.This framework works in an incremental way: Once a model islearned, it can be used to do classification on the images fromthe web resource. The group of images classified as being in thisobject category are incorporated into the collected dataset. Other-wise, they are discarded. The model is then updated by the newlyaccepted images in the current iteration. In this incremental fash-ion, the category model gets more and more robust. As a conse-quence, the collected dataset becomes larger and larger.

searchers rely on a critical resource - an accurate objectclass dataset. A good dataset serves as training data aswell as an evaluation benchmark. A handful of large scaledatasets exist currently to serve such a purpose, such as Cal-tech 101 [4], the UIUC car dataset [1], etc. Sec.1.1 willelaborate on the strengths and weaknesses of these datasets.In short, all of them, however, have rather a limited numberof images and offer no possibility of expansion other thanwith extremely costly manual labor.

So far the story is a frustrating one: Users of the websearch engines would like better search results when look-ing for, say, objects; developers of these search engineswould like more robust visual models to improve these re-

1

sults; vision researchers are developing the models for thispurpose; but in order to do so, it is critical to have largeand diverse object datasets for training and evaluation; this,however, goes back to the same problem that the users face.

In this paper, we provide a framework to simultaneouslylearn object class models and collect object class datasets.This is achieved by leveraging the vast resource of imagesavailable on the Internet. The sketch of our idea is the fol-lowing. Given a very small number of seed images of anobject class (either provided by a human or automatically),our algorithm learns a model that best describes this class.Serving as a classifier, the algorithm can pull from the webthose images that belong to the object class. The newly col-lected images are added to the object dataset, serving as newtraining data to update and improve the object model. Withthis new model, the algorithm can then go back to the weband pull more relevant images. This is an iterative processthat continuously gathers a highly accurate image datasetwhile learning a more and more robust object model. Wewill show in our experiments that our automatic, online al-gorithm is capable of collecting object class datasets that arefar bigger than Caltech 101 or LabelMe [16]. To summa-rize, we highlight here the main contributions of our work.

• We propose an iterative framework that simultaneously col-lects object category datasets and learns the object categorymodels. The framework use Bayesian incremental learningas its theoretical base. To the best of our knowledge, ours isamong the first papers (if not the first) that deals with thesetwo problems together.

• We have developed an incremental learning scheme that usesonly the newly added data points (i.e. images) for training anew model. This memory-less learning scheme is capable ofhandling an arbitrarily large number of images, a vital prop-erty for large image datasets.

• Our experiments show that our algorithm is capable of bothlearning highly effective object category models and collect-ing object category datasets far larger than that of Caltech101 or LabelMe.

1.1. Related works

Image Retrieval from the Web: Content-based im-age retrieval (CBIR) has been an active field of researchfor a number of years. However, we do not regard ourwork as fitting in the conventional framework of CBIR. In-stead of learning to annotate images with a list of wordsand phrases, we instead emphasize collecting the most suit-able images possible from the web resources given a singleword or phrase. One major difference between our workand traditional CBIR is the emphasis on visual model learn-ing. While collecting images of a particular object category,our algorithm continues to learns a better and better visualmodel to classify this object.

A few recent systems in this domain are closer to ourcurrent framework. H. Feng et al. proposed a method to re-fine search engine returns by co-training [7]. Their method,however, does not offer an incremental training frameworkto iteratively improve the collection. Berg and Forsyth de-veloped a system to collect animal pictures from the web[2]. Their system takes advantage of both the text surround-ing the web images and the global feature statistics (patches,colors, textures) of the images to collect a large number ofanimal images. Another method close in spirit to ours is byYanai and Barnard [21]. Though their method focuses onimage annotation, they also utilize the idea of refining webimage returns by a probabilistic model. Finally, two papersby Fergus et al. [8, 10] use idea of training a good objectclass model from web images returned by search engines,hence obtaining an object filter to refine these results. Allthe techniques above achieve better search results by usingeither a better visual model or a combination of visual andtext models to essentially re-rank the rather noisy imagesfrom the web. We show later that by introducing an itera-tive framework of incremental learning, we are able to em-bed the processes of image collection and model learninginto a mutually reinforcing system.

Object Classification: Given the recent explosion of ob-ject categorization research, it is out of the scope of this pa-per to offer a thorough review of the literature. We wouldlike to emphasize that our proposed framework is not lim-ited to the particular object model used in this paper as anexample: any model that can be cast into an incrementallearning framework is suitable for our protocol. Of themany possibilities, we have chosen to use a variant of theHDP (Hierarchical Dirichlet Process) [19] model based on“the bag of words” representation of images. A number ofsystems based on the bag of words model representationhave shown to be effective for object and scene classifica-tion [8, 17, 6]. The models mentioned above are all de-veloped for a batch learning scenario. A handful of objectrecognition works have also dealt with the issue of incre-mental learning explicitly. The most notable ones are [13]and [4]. Our approach, however, is based on a model sig-nificantly different from these papers.

Object Datasets: One main goal of our proposed workis to suggest a framework that can replace much of the cur-rent human effort in image dataset collection for object cat-egorization. A few popular object datasets exist today as themain training and evaluation resources for the community.Caltech 101 contains 101 object classes each containing be-tween 40 to 400 images [4]. It was collected by a group ofstudents spending on average three or four hours per 100images. While it is regarded as one of the most compre-hensive object category datasets now available, it is limitedin terms of the variation in the images (big, centered ob-jects with few viewpoint changes), numbers of images per

category (at most a few hundred) as well as the number ofcategories. Recently, LabelMe has offered an alternativeway of collecting datasets of objects by having people up-load their images and label them [16]. This dataset is muchmore varied than Caltech 101, potentially serving as a bet-ter benchmark for object detection algorithms. But sinceit relies on people uploading pictures and making uncon-trolled annotations, it is difficult to use it as a generic objectdataset. In addition, while some classes have many images(such as 8897 for “car”), others have very few (such as 6for “airplane”). A few other object category datasets suchas [1] are also used by researchers. All of the datasets men-tioned above require laborious human effort to gather andselect the images. In addition, while serving as training andtest datasets for researchers, they are not suitable or usablefor general search engine users. Our proposed work offersa unified way of automatically gathering data useful both asa research dataset as well as for answering user queries.

2. General Framework of OPTIMOL

Algorithm 1 Incremental learning, classification and datacollection

Download from the Web a large reservoir of images obtainedby searching for keyword(s)Initialize the object category dataset with seed images (manu-ally or automatically)repeat

Learn object category model with the latest input images tothe datasetClassify downloaded images using the current object cate-gory modelAugment the dataset with accepted images

until user satisfied or images exhausted

OPTIMOL has two goals to fulfill simultaneously: to au-tomatically collect datasets of object classes from the weband to incrementally learn object category models. We useFig.1 and Alg.1 to illustrate the overall framework. Forevery object category we are interested in, say, “panda”,we initialize our image dataset with a handful of seed im-ages. This can be done either manually or automatically 1.With this small dataset, we begin the iterative process ofmodel learning and dataset collection. Learning is done viaan incremental learning procedure we introduce in Sec.3.3.Given the most updated model of the object class, we per-form a binary classification on a subset of images down-loaded from the web (e.g. panda vs. background) 2. Ifan image is accepted as a “panda” image based on somestatistical criteria (see Sec.3.3), we augment our existing

1To automatically collect a handful of seed images, we can use theimages returned by the first page of Google image search, or any otherstate-of-the-art commercial search engines.

2The background class model is learnt by using a published ‘back-ground’ image dataset [9, 5]

β

Η

θ

πα

γ

z

x

Figure 2. Graphical model of HDP. Each node denotes a random variable.

panda dataset by appending this new image. We then up-date our panda model with the subset of the newly acceptedimages (see Sec.3.4 for details of the “cache set”). Note thatthe already existing images in the dataset no longer partici-pate in this round of learning. Meanwhile, the backgroundmodel will also be updated using a constant resource ofbackground images. We repeat this process till a sufficientdataset is collected or we have exhausted all downloadedimages.

3. Detailed System of OPTIMOL3.1. Object category model

We choose the “bag of words” representation for our ob-ject category model, but our system is not committed to anyparticular choice of object category representation or modelstructure. As long as one could cast the model into an incre-mental learning framework, it would be theoretically suit-able for OPTIMOL. We make this choice, however, basedon the recent successes of the “bag of words” models inobject and scene recognition [17, 6], particularly the usageof latent topic models for such representation [11, 3, 19].Similarly to [18, 20], we adapt the Hierarchical Dirichletprocess(HDP) [19] for our object category model. Com-pared to parametric latent topic models such as LDA [3] orplSA [11], HDP offers a way to sample an infinite numberof latent topics, or clusters, for each object category model.This property is especially desirable for OPTIMOL becauseas we grow our dataset, we would like to retain the abilityto ‘grow’ the object class model when new clusters of im-ages arise. We introduce our HDP object category model inmore detail in Sec.3.2.

3.2. Hierarchical Dirichlet processWe represent an image as a bag of visual words. Each

category consists of a variable number of latent topics cor-responding to clusters of images that have similar visualwords attributes. We model both object and backgroundclasses with HDP [19]. Given γ, α as the concentration pa-rameters and H as a base probability measure, HDP definesa global random probability measure G0 ∼ DP (γ, H).Based on G0, a random measure Gj ∼ DP (α, G0) is in-dependently sampled for each group to explain the internalstructure. Here, DP represents the Dirichlet process.

Fig.2 shows the graphical model of HDP. θ correspondsto the distributions of visual words given different latenttopics shared among different images. H indicates the priordistribution of θ. Let xji be the ith patch in jth image. Foreach patch xji, there is a hidden variable zji denoting thelatent topic index. If β is the stick-breaking weights and πj

is the mixing proportion of z for the jth image, the hierar-chical Dirichlet process can be expressed as:

β|γ ∼ GEM(γ) πj |α, β ∼ DP(α, β)θk|H ∼ H zji|πj ∼ πj xji|zji, θk ∼ F (θzji

) (1)

where GEM represents the stick-breaking process.For both Fig.2 and Eq.1, we omit the mention of object

category to avoid confusion. The distribution of π is classspecific, and so is x. In other words, two class-specific θsgovern the distribution of x. Given an object class c, a topicz can be generated using a multinomial distribution param-eterized by π. Given this topic, a patch is generated usingthe multinomial distribution of F (θc

z).

3.3. Incremental learning of a latent topic modelGiven the object class model, we propose an incremen-

tal learning scheme such that OPTIMOL could update themodel at every iteration of the dataset collection process.Our goal here is to perform incremental learning by usingonly new images selected at this given iteration. We willillustrate in Fig.7(b) that this is much more efficient thanperforming a batch learning with all images in the existingdataset at every iteration. Let Θ denote the model param-eters, and Ij denote the jth image represented by a set ofpatches xj1, · · · , xjn. For each patch xji, there is a hiddenvariable zji denoting the latent topic index. The model pa-rameters and hidden variable are updated iteratively usingthe current model and the input image Ij in the followingfashion:

zj ∼ p(z|Θj−1, Ij) Θj ∼ p(Θ|zj , Θj−1, Ij) (2)

where Θj−1 represents the model parameters learned fromthe previous j − 1 images. Neal & Hinton [15] provides atheoretical ground for incrementally learning mixture mod-els via sufficient statistics updates. We follow this idea bykeeping only the sufficient statistics of the parameters asso-ciated with the existing images in an object dataset. Learn-ing is then achieved by updating these sufficient statisticswith the those provided by the new images. One straight-forward method is to use all the new images accepted by thecurrent classification criterion. It turns out that this methodwill favor too much those images with a similar appearanceto the existing ones, hence resulting in more and more spe-cialized object models. In order to take full advantage of thenon-parametric HDP model, as well as to avoid this “overspecialization”, we only use a subset of images to updateour model. This subset is called the “cache set”. We detailthe selection of the “cache set” in Sec.3.4.

3.3.1 Markov Chain Monte Carlo samplingThe goal of learning is to update the parameters in the hier-archical model. In this section, we describe how we learnthe parameters by Gibbs sampling [13] of the latent vari-ables. We choose the popular Chinese restaurant franchise[19] metaphor to describe this procedure. Imagine multipleChinese restaurants sharing a set of dishes. At each table ofeach restaurant, a dish is shared by the customers sitting atthat table. Metaphorically, we describe the jth image as thejth restaurant and the image level mixture component forxji as a table tji, where xji is the ith customer in the jthrestaurant. Similarly, the global latent topic for the tth tablein the jth restaurant is represented as the dish kjt :

tji|tj1, . . . , tji−1, α,G0 ∼Tj∑

t=1

njtδtji=t + αG0 (3)

kjt|k11, k12, . . . , k21, . . . , kjt−1, γ ∼K∑

k=1

mkδkjt=k + γH (4)

where njt denotes the number of customers sitting at the tthtable in the jth restaurant, Tj is the current number of tablesin the jth restaurant, mk represents the number of tables or-dered dish k, and K denotes the current number of dishes.A new table and new dish can also be generated from G0

and H , respectively, when needed.Sampling the table. According to Eq.3 and Eq.4, the prob-ability of a new customer xji assigned to table t is:

P (tji = t|xji, t−ji, k) ∝

a0ptnew for t = tnew

njtf(xji|θkji) for used t(5)

where ptnew is the likelihood for tji = tnew:K∑

k=1

mk∑Kk=1 mk + γ

f(xji|θkji) +γ∑K

k=1 mk + γf(xji|θknew)

f(xji|θkji) is the conditional density of patch xji given alldata items associated with global latent topic k except itself.The probability of assigning a newly generated table tnew

to a global latent topic is proportional to Eq.6.

mkf(xji|θkji) for used k

γf(xji|θknew) for new k(6)

Sampling the global latent topic. For the existing tables,the dish can change according to all customers of table. Asample of the global latent topic for the image level mixturecomponent t in image j, kjt, can be obtained from:

mkf(xjt|θkji) for used k

γf(xjt|θknew) for new k(7)

Similarly, f(xjt|θkjt) is the conditional density of a set ofpatches xjt given all patches associated with topic k exceptthemselves. A new global latent topic will be sampled fromH if k=knew according to Eq.7. njt and mk will be updatedrespectively regarding the table index and global latent topicassigned. Given zji = kjtji , we in turn update F (θc

zij).

3.4. New Image Classification and AnnotationFor every iteration of the dataset collection process, we

have a binary classification problem: classify images withforeground object versus background images. Given thecurrent model, we have p(z|c) parameterized by the distri-bution of global latent topics for each class in the Chineserestaurant franchise and p(x|z, c) parameterized by F (θc

z)learned for each category c by Gibbs sampling. A testingimage I is represented as a collection of local patches xi,where i = {1, . . . ,M} and M is the number of patches.The likelihood p(I|c) for each class is calculated by:

P (I|c) =∏

i

∑

j

P (xi|zj , c)P (zj |c) (8)

Classification is made by choosing the category model thatyields the higher probability. From a dataset collectionpoint of view, incorporating an incorrect image into thedataset (false positive) is much worse than missing a correctimage (false negative). Hence, a risk function is introducedto penalize false positives more heavily:

Ri(A|I) = λAcfP (cf |I) + λAcb

P (cb|I)Ri(R|I) = λRcf

P (cf |I) + λRcbP (cb|I) (9)

Here A represents acceptance of an image into our dataset,R rejection. As long as the risk of accepting this imageis lower than rejecting it, it gets accepted. Updating thetraining set is finally decided by the likelihood ratio:

P (I|cf )P (I|cb)

>λAcb

− λBcb

λRcf− λAcf

P (cb)P (cf )

(10)

where the cf is the foreground category while the cb is thebackground category. λAcb

−λBcb

λRcf−λAcf

is automatically adjustedby applying the likelihood ratio measurement to a referencedataset3 at every iteration. New images satisfying Eq.10 areincorporated into the collected dataset.

The goal of OPTIMOL is not only to collect a good im-age dataset, but also to provide further information aboutthe location and size of the objects contained in the datasetimages. Object annotation is carried out by first calculatingthe likelihood of each patch given the object class cf :

p(x|cf ) =∑

i

p(x|zi, cf )p(zi|cf ) (11)

The region with the most concentrated high likelihoodpatches is then selected as the object region. Sample resultsare shown in Fig.5.

As we mentioned in Sec.3.3, we use a “cache set” of im-ages to incrementally update our model. The “cache set”is a less “permanent” set of good images compared to the

3To achieve a fully automated system, we use the original seed imagesas the reference dataset. As the training dataset grows larger, the directeffect of the original training images vanishes in terms of the object model.It therefore becomes a good approximation of a validation dataset.

actual image dataset. At every round, if all “good” imagesare used for model learning, it is highly likely that many ofthese images will look very similar to the previously col-lected images, hence reinforcing the model to be even morespecialized in picking out such images for the next round.So the usage of the “cache set” is to retain a group of im-ages that tend to be more diverse from the existing imagesin the dataset. For each new image passing the classifica-tion qualification (Eq.10), it is further evaluated by Eq.12 todetermine whether it should belong to the “cache set” or thepermanent set.

H(I) = −∑

z

p(z|I) ln p(z|I) (12)

According to Shannon’s definition of entropy, Eq.12 relatesto the amount of uncertainty about an event associated witha given probability distribution. Images with high entropyare more uncertain, which indicates possible new topics, orits lack of strong membership to one single topic. Thus,these high likelihood and high entropy images are good formodel learning. Meanwhile, images with low entropy areregarded as confident foreground images, which will be in-corporated into the permanent dataset.

4. Experiments & ResultsWe conduct 3 experiments to illustrate the effective-

ness of OPTIMOL. Exp.1 demonstrates the superior datasetcollection results of OPTIMOL over the existing datasets.Exp.2 shows that OPTIMOL is on par with the state ofthe art object models [8] for multiple object classification.Exp.3 shows a performance comparison of the batch vs. in-cremental learning methods. We first introduce the variousdatasets, then show in Sec.4.2 a walkthrough for how OP-TIMOL works for the accordion category.4.1. Datasets Definitions

We define the following 3 different datasets that we willuse in our experiments:

1. Caltech 101-Web & Caltech 101-Human2 versions of the Caltech 101 dataset are used in our exper-iment. Caltech 101-Web is the original raw dataset down-loaded from the web with a large portion of contaminatedimages in each category. The number of images in each cat-egory varies from 113 (winsor-chair) to 1701 (watch). Cal-tech 101-Human is the clean dataset manually selected fromCaltech 101-Web. By using this dataset, we show that OP-TIMOL can achieve comparable or even better retrieval per-formance compared to human labeled results.

2. Web-23We downloaded 21 object categories from online imagesearch engines with corresponding query words randomlyselected from object categories in Caltech 101-Web. In ad-dition, “face” and “penguin” categories are also included inWeb-23 for further comparison. The number of images ineach category varies from 577 (stop-sign) to 12414 (face).

Most of the images in a category are incorrect images (e.g.352 correct accordions out of 1659 images).

3. Fergus ICCV’05 datasetA 7-Category dataset provided by [8]. Object classes are:airplane, car, face, guitar, leopard, motorbike and watch.

4.2. Walkthrough for the accordion category

As an example, we describe how OPTIMOL collects im-ages for the accordion category following Alg.1 and Fig.1.We first download 1659 images by typing the query word“accordion” in image search engines such as Google im-age, Yahoo image and Picsearch. We use the first 15 im-ages from the web resource as our seed images, assumingthat most of them are good quality accordions. We repre-sent each image as a set of local regions. We use the Kadirand Brady [12] salient point detector to find the informa-tive local regions. A 128-dim rotationally invariant SIFTvector is used to represent each region [14]. We build a500-word codebook by applying K-means clustering to the89058 SIFT vectors extracted from the 15 seeds images ofeach of the 23 categories. In Fig.3, we show some detectedinterest regions as well as some codeword samples. Fig.4illustrates the first and second iterations of OPTIMOL foraccordion category model learning, classification and im-age collection.

Figure 3. Left: Interest regions found by Kadir&Brady detector. Thecircles indicate the interest regions. The red crosses are the centers of theseregions. Right: Sample codewords. Patches with similar SIFT descriptorare clustered into the same codeword, which are presented in same color.

4.3. Exp.1: Image Collection

21 object categories are selected randomly from Caltech101-Web for this experiment. The experiment is split intotwo parts: 1. Retrieval from Caltech 101-Web. The num-ber of collected images in each category is compared withthe same numbers in Caltech 101-Human. 2. Retrieval fromWeb-23 using the same 21 categories as in part 1. Results ofthese two parts are displayed in Fig.5. We first observe thatOPTIMOL is capable of automatically collecting very simi-lar number of images from Caltech 101-Web as the humanshave done by hand in Caltech 101-Human. Furthermore, byusing images from Web-23, OPTIMOL achieves on average6 times as many images as Caltech 101-Human (some even10× higher). In Fig.5, we also compare our results withLabelMe [16] for each of the 22 categories. In addition,a “penguin” category is included so that we can compareour results with the state-of-art dataset collecting approach

Seeds

Images held in the cache

New images to the dataset

Enlarged permanent dataset

Seeds

Initialize Learn Category

Model

Raw images

Classification

Accepted images

Reje

cted

imag

es

Iteration 1

Appe

nd

Category

Model

Incr

emen

tal

Classification

Second Round of Classification

More raw images

Accept images

Reject

Images

Iteration 2

More iterations...

lear

ning

To next round

Figure 4. Example of the first and second iterations of OPTIMOL. In thefirst iteration, a model is learned using the seed images. Classification isdone on a subset of raw images collected from the web using the “accor-dion” query. Images with low likelihood ratios given by Eq.10 will bediscarded. For the rest of the images, those with low entropies given byEq.12 are incorporated into the permanent dataset, while the high entropyones stay in the “cache set”. In the second iteration, we update the modelusing only the images held in the ”cache set” from the previous iteration.Classification is done on the new raw images as well as those in the cache.In a similar way, we append the images with high likelihood ratio and lowentropy to the permanent dataset and hold those with high likelihood ratioand high entropy in the “cache set” for the next iteration of model learning.This continues till the system comes to a halt.

[2]. In all cases, OPTIMOL collected more positive imagesthan the Caltech 101-Human, the LabelMe dataset and theapproach in [2], with very few mistakes. Note that all ofthese results are achieved without any human intervention,thus suggesting the viability of OPTIMOL as an alternativeto costly human dataset collection.4.4. Exp.2: Classification

To demonstrate that OPTIMOL not only collects largedatasets of images, but also learns good models for objectclassification, we conduct an experiment using the FergusICCV’05 dataset under the settings as in [8]. 7 object cat-egory models are learnt from the same training sets usedby [8]. Similarly to [8], we use a validation set to train a7-way SVM classifier to perform object classification. Thefeature vector of the SVM classifier is a vector of 7 entries,each denoting the image likelihood given each of the 7 classmodels. The results are shown in Fig.6, where we achievean average performance of 74.8%. This result is compara-ble to (slightly better than) the 72.0% achieved by [8]. Ourresults show that OPTIMOL is capable of learning reliableobject models.

4.5. Exp.3: Comparison of incremental learningand batch learning

In this experiment, we compare both the computationtime and accuracy of the incremental learning and the batch

airplane

a

car

c

face

f

guitar

motorbike

m

watch

w

76.0 14.0 0.3 5.3 0.3 0.3 4.8

1.0 94.5 0.3 4.5 0.3 0.3 0.3

0.5 1.4 82.9 3.7 0.5 0.5 11.5

2.2 4.9 5.6 60.4 13.3 0.2 13.3

1.0 2.0 1.0 5.0 89.0 1.0 2.0

0.3 5.5 0.3 5.5 1.0 67.3 20.5

1.7 5.5 17.7 11.0 5.5 5.0 53.6

leopard

lg

Figure 6. Confusion table for Exp.2. We use the same training and test-ing datasets as in [8]. The average performance of the OPTIMOL-trainedclassifier is 74.82%, whereas [8] reports 72.0%.

learning (Fig.7). Due to the space limit, all results shownhere are collected from the “euphonium” dataset; otherdatasets yield similar behavior. Fig.7(a) shows that the in-cremental learning method yields a better dataset than thebatch method. Fig.7(b) illustrates that by not having to trainwith all available images at every iteration, OPTIMOL ismore computationally efficient than a batch method. Fi-nally, we show a ROC comparison of OPTIMOL vs. thebatch method. In our system, the classifiers change everyiteration according to the updates of the models. It is there-fore not meaningful to show the ROC curves for the inter-mediate step classifiers. Thus, an ROC is displayed to com-pare the classification performance of the model learned bythe batch learning and the final model of the incrementallearning. Classifier quality is measured by the area underits ROC curve.

0

50

100

150

200

250

300

350

Incremental Batch

Nu

mb

er

of

ima

ge

s

10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

140

160

180

Number of training images

Tim

e (

se

c)

Incremental learningBatch learning

de

te

ctio

n r

ate

false alarm rate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Figure 7. Batch vs. Incremental Learning (case study for the “eupho-nium” category). Left: the number of images retrieved by incrementallearning and batch learning with false alarms represented as a darker haton the top of each bar. Middle: running time comparison of batch learningand OPTIMOL’s incremental learning method as a function of number oftraining images. The incrementally learned model is initialized by usingthe batch mode on 10 training images, which takes the same time as batchmethod does. After initialization, incremental learning is more efficientcompared to the batch method. Right: Receiver Operating Characteristic(ROC) Curves of the incrementally learned model (green lines) versus themodel learned by using the same seeds images (red line). The area underthe ROC curve of OPTIMOL is 0.94, while it is 0.90 for batch learning.

5. Conclusion and future workWe have proposed a new approach (OPTIMOL) for im-

age dataset collection and model learning. Our experimentsshow that as a fully automated system, OPTIMOL achievesaccurate dataset collection result nearly as good as thoseof humans. In addition, it provides a useful annotation ofthe objects in the images. Further experiments show thatthe models learnt by OPTIMOL are competitive with thecurrent state-of-the-art model object classification. Humanlabor is one of the most costly and valuable resources in re-search. We provide OPTIMOL as a promising alternative to

collect larger image datasets with high accuracy. For futurestudies, we will further improve the performance of OPTI-MOL by refining the model learning step and introducingmore descriptive object models.References[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in

images via a sparse, part-based representation. PAMI, 26(11):1475–1490, 2004.

[2] T. Berg and D. Forsyth. Animals on the web. In Proc. ComputerVision and Pattern Recognition, 2006.

[3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journalof Machine Learning Research, 3:993–1022, 2003.

[4] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual mod-els from few training examples: an incremental Bayesian approachtested on 101 object categories. In Workshop on Generative-ModelBased Vision, 2004.

[5] L. Fei-Fei, R. Fergus, and P. Perona. One-Shot learning of objectcategories. IEEE Transactions on Pattern Analysis and Machine In-telligence, 2006.

[6] L. Fei-Fei and P. Perona. A Bayesian hierarchy model for learningnatural scene categories. CVPR, 2005.

[7] H. Feng and T. Chua. A bootstrapping approach to annotating largeimage collection. Proceedings of the 5th ACM SIGMM internationalworkshop on Multimedia information retrieval, pages 55–62, 2003.

[8] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning Ob-ject Categories from Googles Image Search. Computer Vision, 2005.ICCV 2005. Tenth IEEE International Conference on, 2, 2005.

[9] R. Fergus, P. Perona, and A. Zisserman. Object class recognition byunsupervised scale-invariant learning. In Proc. Computer Vision andPattern Recognition, pages 264–271, 2003.

[10] R. Fergus, P. Perona, and A. Zisserman. A visual category filter forGoogle images. In Proc. 8th European Conf. on Computer Vision,2004.

[11] T. Hofmann. Probabilistic latent semantic indexing. Proceedings ofthe 22nd annual international ACM SIGIR conference on Researchand development in information retrieval, pages 50–57, 1999.

[12] T. Kadir and M. Brady. Scale, saliency and image description. Inter-national Journal of Computer Vision, 45(2):83–105, 2001.

[13] S. Krempp, D. Geman, and Y. Amit. Sequential learning withreusable parts for object detection. Technical report, Johns HopkinsUniversity, 2002.

[14] D. Lowe. Object recognition from local scale-invariant features. InProc. International Conference on Computer Vision, pages 1150–1157, 1999.

[15] R. Neal and G. Hinton. A view of the EM algorithm that justifiesincremental, sparse and other variants. In M. Jordan, editor, Learn-ing in Graphical Models, pages 355–368. Kluwer academic press,Norwell, 1998.

[16] B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: adatabase and web-based tool for image annotation. 2005.

[17] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Dis-covering object categories in image collections. In Proc. Interna-tional Conference on Computer Vision, 2005.

[18] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Describ-ing visual scenes using transformed dirichlet processes. Advances inNeural Information Processing Systems, 18:1297–1304, 2005.

[19] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. HierarchicalDirichlet processes. To appear in Journal of the American StatisticalAssociation, 2006.

[20] G. Wang, Y. Zhang, and L. Fei-Fei. Using dependent regions forobject categorization in a generative framework, 2006.

[21] K. Yanai and K. Barnard. Probabilistic web image gathering. ACMSIGMM workshop on Multimedia information retrieval, pages 57–64, 2005.

face

face

faceface

0

200

400

600

800

1000

1200

1400

1600

Nu

mb

er

of

ima

ge

s

N/A

watch watch

watchwatch

500

1000

1500

Nu

mb

er

of

ima

ge

s

0

laptoplaptoplaptoplaptoplaptop

laptop

laptop

laptop laptop

0

200

400

600

800

1000

1200

Nu

mb

er

of

ima

ge

s amanitaamanita

amanita

amanita

200

400

600

800

1000

1200

Nu

mb

er

of

ima

ge

s

0

schooner

schooner

schooner schooner

schoonerschooner

100

200

300

400

500

600

Nu

mb

er

of

ima

ge

s

0

grand−piano

grand−piano

grand−piano grand−piano

100

200

300

400

500

600

700

800

900

1000

Nu

mb

er

of

ima

ge

s

0

bonsai

bonsai

bonsai

bonsaibonsai

100

200

300

400

500

600

700

800

900

1000

Nu

mb

er

of

ima

ge

s

0

sun�ower

sun�ower

sun�ower sun�ower

100

200

300

400

500

600

Nu

mb

er

of

ima

ge

s

0

revolverrevolver

revolver

revolver

100

200

300

400

500

600

Nu

mb

er

of

ima

ge

s

0

star�shstar�sh

star�sh star�sh

0

50

100

150

200

250

Nu

mb

er

of

ima

ge

s

euphonium

euphoniumeuphonium

euphonium euphonium

50

100

150

200

250

300

350

400

Nu

mb

er

of

ima

ge

s

0

pyramidpyramid

pyramidpyramid

50

100

150

200

250

300

350

400

Nu

mb

er

of

ima

ge

s

0

pagoda

pagodapagoda

pagoda

pagoda

pagoda

pagoda

0

50

100

150

200

250

300

350

Nu

mb

er

of

ima

ge

s

accordion

accordion

accordion accordion

0

50

100

150

200

250

300

Nu

mb

er

of

ima

ge

s

panda

panda

pandapanda 50

100

150

200

250

Nu

mb

er

of

ima

ge

s

0

strawberrystrawberry

strawberry strawberry

50

100

150

Nu

mb

er

of

ima

ge

s

0

nautilus

nautilus

nautilus

nautilus

20

40

60

80

100

120

140

160

180

Nu

mb

er

of

ima

ge

s

0

stop−sign

stop−sign

stop−sign

stop−signstop−signstop−sign

50

100

150

Nu

mb

er

of

ima

ge

s

0

0

100

200

300

400

500

600

700

800

Num

ber

of i

mag

es

Umbrella

0

100

200

300

400

500

600

Num

ber

of i

mag

es

Soccer-ball

0

20

40

60

80

100

120

140

160

180

Num

ber

of i

mag

es

Inline-skate

0

50

100

150

200

250

300

350

400

450

500

Num

ber

of i

mag

es

web(OPTIMOL) PLabelme 101(Human) web(OPTIMOL) F101(OPTIMOL) P 101(OPTIMOL) F

Menorah

0

100

200

300

400

500

600

700

800

900

1000

web(OPTIMOL) P T L Berg

web(OPTIMOL) F

Penguin

Figure 5. Image collection and annotation results by OPTIMOL. Due to the space limit, we only show the annotation figures for eighteen categories, andfor the remaining five categories, only the image collection results are shown here (Refer to the bar plots shown in the bottom row). The notations of the barsare provided at the bottom. The first 18 blocks are presented in two columns, with each row of the left (right) half page representing a category. Let us use“Laptop” as an example. The left sub-panel gives 4 sample annotation results (bounding box indicates the estimated locations and sizes of the laptop). Theright sub-panel shows the comparison of the number of images in “Laptop” category given different datasets. The blue bar indicates the number of “Laptop”images in LabelMe dataset, the yellow bar the number of images in Caltech 101-Human. The OPTIMOL result is displayed using the red and green bars.The red bar represents the number of images retrieved for the “Laptop” category in Caltech 101-Web, the dark part on the top of the red bar is the numberof False Positives. The green bar shows the number of clean images retrieved from the “Laptop” category in the Web-23 dataset. Again, the dark part on thetop is the number of False Positives. In the last row of the figure, five bar plots for five different object categories are shown. The colors of the bars have thesame meaning as the first 18 blocks. Since the pictures in the “face” category of Caltech 101-Human were taken by camera instead of downloading from theweb, the raw images of the “face” category are not available. All of our results have been put online at http://vision.cs.princeton.edu/projects/OPTIMOL.htm

Documents

OPTIMOL: automatic Online Picture collecTion via ...vision.stanford.edu/lijiali/JiaLi_files/LiWangFei-Fei...OPTIMOL: automatic Online Picture collecTion via Incremental MOdel Learning