Automatic Image Annotation using Multiple Grid Segmentation

Automatic Image Annotation using MultipleGrid Segmentation

Gerardo Arellano, L. Enrique Sucar, Eduardo F. Morales

Computer Science DepartmentInstituto Nacional de Astrofısica, Optica y ElectronicaLuis Enrique Erro 1, Tonantzintla, Puebla, Mexico

{garellano, esucar, emorales}@inaoep.mx

Abstract. Automatic image annotation refers to the process of auto-matically labeling an image with a predefined set of keywords. Image an-notation is an important step of content-based image retrieval (CBIR),which is relevant for many real-world applications. In this paper, a newalgorithm based on multiple grid segmentation, entropy-based informa-tion and a Bayesian classifier, is proposed for an efficient, yet very effec-tive, image annotation process. The proposed approach follows a two stepprocess. In the first step, the algorithm generates grids of different sizesand different overlaps, and each grid is classified with a Naive Bayes clas-sifier. In a second step, we used information based on the predicted classprobability, its entropy, and the entropy of the neighbors of each grid el-ement at the same and different resolutions, as input to a second binaryclassifier that qualifies the initial classification to select the correct seg-ments. This significantly reduces false positives and improves the overallperformance. We performed several experiments with images from theMSRC-9 database collection, which has manual ground truth segmen-tation and annotation information. The results show that the proposedapproach has a very good performance compared to the initial labeling,and it also improves other scheme based on multiple segmentations.

1 Introduction

Most recent work on image labeling and object recognition is based on a slidingwindow approach [1], avoiding in this way the difficult problem of image seg-mentation. However, the rectangular regions used in these approaches do notprovide, in general, a good spatial support of the object of interest; resulting inobject features that are not considered (outside the rectangle) or incorrectly in-cluded (inside the rectangle). Recently, it has been shown [2] that a good spatialsupport can significantly improve object recognition. In particular, they demon-strate that by combining different image segmentation techniques, we can obtainbetter results, with respect to a sliding window approach or any of the segmenta-tion techniques by themselves. They found that for all the algorithms considered,multiple segmentations drastically outperform the best single segmentation andalso that the different segmentation algorithms are complementary, with each

algorithm providing better spatial support for different object categories. So itseems that by combining several segmentation algorithms, labeling and objectclassification can be improved. In [2] the authors do not solve the problem ofhow to combine automatically the several segmentation methods.

More recently, Pantofaru et al. [3] proposed an alternative approach to com-bine multiple image segmentation for object recognition. Their hypotheses isthat the quality of the segmentation is highly variable depending on the image,the algorithm and the parameters used. Their approach relies on two principles:(i) groups of pixels which are contained in the same segmentation region in mul-tiple segmentations should be consistently classified, and (ii) the set of regionsgenerated by multiple image segmentations provides robust features for classify-ing these pixel groups. They combine three segmentation algorithms, NormalizedCuts [4], Mean-Shift [5] and Felzenszwalb and Huttenlocher [6], with differentparameters each. The selection of correct regions is based on Intersection ofRegions, pixels which belong to the same region in every segmentation. Sincedifferent segmentations differ in quality, they assume that the reliability of aregion’s prediction corresponds to the number of objects it overlaps with respectto the class labels. They tested their method with the MSRC 21 [7] and PascalVOC2007 data sets [8], showing an improved performance with respect to anysingle segmentation method.

A disadvantage of the previous work is that the segmentation algorithmsutilized are computationally very demanding, and these are executed severaltimes per image, resulting in a very time–consuming process, not suitable forreal time applications such as image retrieval. Additionally, simple segmentationtechniques, such as grids, have obtained similar results for region–based imagelabeling than those based on more complex methods [9].

We propose an alternative method based on multiple grid segmentation forimage labeling. The idea is to take advantage of multiple segmentations to im-prove object recognition, but at the same time to develop a very efficient imageannotation technique applicable in real–time. The method consists of two mainstages. In the first stage, an image is segmented in multiple grids at differentresolutions, and each rectangle is labeled based on color, position and texturefeatures using a Bayesian classifier. Based on the results of this first stage, in thesecond stage another classifier qualifies each segment, to determine if the initialclassification is correct or not. This second classifier uses as attributes a set ofstatistical measures from the first classifiers for the region of interest and itsneighbors, such as the predicted class probability, its entropy, and the entropyof the neighbors of each grid element at the same and different resolutions. Theincorrect segments according to the second classifier are discarded, and the finalsegmentation and labeling consists of the union of only the correct segments.

We performed several experiments with images from the MSRC-9 databasecollection, which has manual ground truth segmentation and annotation infor-mation. The results show that the proposed approach is simple and efficient, andat the same time it has a very good performance, in particular reducing the falsepositives compared to the initial annotation. We also compared our approach for

selecting segments based on a second classifier, against the method of [3] whichuses region intersection, with favorable results.

The rest of the paper is organized as follows. Section 2 presents an overallview of the proposed approach. Section 3 describes the segmentation process andthe feature extraction mechanism. Section 3.3 describes the first classifier usedto label the different segments in the images, while Section 4 explains the secondclassifier used to remove false positives and improve the overall performance ofthe system. In Section 5, the experiments and main results are given. Section 6summarizes the main conclusions and provides future research directions.

2 Image Annotation Algorithm

The method for image annotation based on multiple grid segmentation consistsof two main phases, each with several steps (see Figures 1 and 2):

Phase 1 – Annotation:

1. Segment the image using multiple grids at different resolutions.2. Extract global features for each segment: color, texture and position.3. Classify each segment based on the above attributes, using a Bayesian clas-

sifier previously trained based on a set of labeled images.

Phase 2 – Qualification:

1. For each segment obtain a set of statistical measures based on the results ofthe first classifier: class probability, entropy of the region, and entropy of itsneighboring regions at the same and different resolutions.

2. Based on the statistical measures, use another binary Bayesian classifier toestimate if the original classification of each segment is correct/incorrect.

3. Discard incorrect segments (based on certain probability threshold).4. Integrate the correct segments for the final image segmentation and labeling.

In the following sections each phase is described in more detail.

3 Phase 1: initial annotation

Before applying our classifiers for image annotation, we perform two operationson the images: (i) segmentation and (ii) features extraction.

3.1 Image Segmentation

Carboneto [9] found that a simple and fast grid segmentation algorithm can havebetter performance than a computationaly expensive algorithm like NormalizedCuts [4]. The effectiveness of grid segmentations depends on the size of thegrid and on the nature of the images. On the other hand, combining severalsegmentation algorithms tend to produce better performance, as suggested in [2]

Fig. 1. Phase 1: annotation. For each image in the database we compute multiple gridsegmentations. For each segment we extract position, color and texture features; anduse these features to classify each segment.

and shown in [3]. In this paper we propose to use several grid segmentations, labeleach segment and combine the results, integrating two relevant ideas: (i) gridsegmentation has a very low computational cost with reasonable performanceso it can be efectively used in image retrieval tasks, and (ii) using grids ofdifferent sizes can produce relevant information for different regions and imagesof different nature. So rather than guessing which is the right grid size for eachimage, we used a novel approach to combine the information of the differentlabels assigned to each grid to improve the final segmentation and labellingresults. In this paper we used three grid segmentations of different sizes, but theapproach can be easily extended to include additional grids.

3.2 Features

There is a large number of image features that have been proposed in the liter-ature. In this paper we extract features based on position, color and texture foreach image segment, but other features could be incorporated.

As position features, we extract for each image segment the average andstandard deviations of the coordinates, x and y, of the pixels in the rectangularregion.

Color features are the most common features used in image retrieval. In thispaper we used the average and standard deviation of the three RGB channelsfor each image segment (6 features). In the CIE-LAB Color space the numericdifferences between colors agrees more consistently with human visual percep-tions. Thus, we also include the average, standard deviation and skewness of thethree channels of the CIE-LAB space for each image segment.

Fig. 2. Phase 2: qualification. For all the classified segments, we obtain the probabilityof the label, its entropy, the neighborhoods entropy and the intersection entropy; asinput to another classifier to reinforce the confidence of each label and then we finallyselect the correct segments and merge segments with the same class.

The perception of textures also plays an important role in content-basedimage retrieval. Texture is defined as the statistical distribution of spatial de-pendences for the gray level properties [10]. One of the most powerful tools fortexture analysis are the Gabor filters [11], a linear filter used in image processingfor edge detection. Frequency and orientation representations of Gabor filter aresimilar to those of human visual system, and it has been found to be partic-ularly appropriate for texture representation and discrimination. Gabor filterscould be viewed as the product of a low pass (Gaussian) filter at different orien-tations and scales. In this paper we applied Gabor filters with four orientationsθ = [0, 45, 90, 135], and two different scales, obtaining 8 filters in total.

3.3 Base Classifier

The previously described features are used as input to a Naive Bayes classifier.This classifier assumes that the attributes are independent between each othergiven the class, so using Bayes theorem the class probability given informationof the attributes (P (Ci|A1, ..., An)) is given by:

P (Ci|A1, ..., An) =P (Ci)P (A1|Ci), ..., (An|Ci)

P (A1, ..., An)(1)

where Ci is the i-th value of the class variable on several feature variablesA1, ..., An. For each segment we obtain the probabilities of all the classes, andselect the class value with maximum a posteriori probability; in this way welabel all the segments created by the multi-grid segmentation process.

4 Phase 2: annotation qualification

In the second phase we use a second classifier to qualify the classes given bythe first classifier and improve the performance of the system. This is a binaryclassifier that decides whether the predicted label of the first classifier is corrector not, given additional contextual information. We compute the likelihood ofthe predicted class using another Naive Bayes Classifier.

As attributes for this second classifier we use:

Class Probability: The probability of the label given by the first classifier.Entropy: The entropy of each segment is evaluated considering the probabilities

of the predicted labels by the first classifier, defined as:

H(s) = −n∑

i=1

P (Ci) log2 P (Ci) (2)

where P (Ci) is the likelihood of prediction for class i and n is the totalnumber of the classes.

Neighborhood Entropy: We also extract the entropy of the segment’s neigh-bors and add this information if they have the same class:

H(v) =1

|vc|∑x∈Vc

H(x)δ(Class(x), Class(v)) (3)

where Vc are all neighbors segments, H(x) is the entropy from neighborssegments with the same Class(v), normalized by the number of neighborswith the same class |vc| and,

δ(x, v) =

{1 if x = v0 otherwise

This is illustrated in Figure 3, where the class label is illustrated with dif-ferent colors. In this case, three neighbors have the same class as the centralgrid, while one neighbor has a different class. The value of the attribute willhave the normalized sum of the entropies of three neighbors with the sameclass.

Intersection Entropy: We also consider information of the cells from othergrids that have an intersection with the current segment. In this case weused Equation 3 and applied it to the segments of the different grids. In thecase of the three grids considered in the experiments, the cells in the largestgrid size have 20 neighbors each, the middle size segments have 5 neighborseach and the smallest segments have only two neighbors. This is illustratedin Figure 4. Different grid segmentations and intersection schemes could beused as well.

Combined Attributes: We also incorporated new attributes defined by a com-bination of some of the previous attributes, namely the entropy of the seg-ment plus the neighbor’s entropy and the entropy of the segment plus the

Fig. 3. Neighbors in the same grid. We consider the top, down, right and left neighborsand obtain their entropy, considering only the neighbors that have the same class asthe current segment (best seen in color).

Fig. 4. Neighbors in the other grids. The image is segmented in grids of different sizes,each segment will have a different number of intersected neighbors. For the three gridswe used in the experiments, the largest segments have 20 neighbors, the middle sizesegments have 5 neighbors, and the smallest segments have only 2 neighbors (best seenin color).

intersection’s entropy. The incorporation of additional features that repre-sent a combination of features can sometimes produce better results. Otherfeatures and other combinations could be used as well.

The second classifier is shown in Figure 5. This classifier is used to qualifyeach segment and filter the incorrect ones (those with a low probability of beingcorrect are discarded).

Fig. 5. Second classifier. Graphical model for the qualification classifier, showing thefeatures used to decide if segments were correctly or incorrectly classified.

5 Experiments and Results

We performed several experiments using the MSRC-9 database with 9 classes:grass, cow, sky, tree, face, airplane, car, bicycle and building. All images wereresized to 320x198 or to 198x320 depending on the original image size, andsegmented using three different grids with cell sizes of 64, 32 and 16 pixels.

The database has 232 images, 80% were randomly selected and used fortraining and 20% for testing for both classifiers. Position, color and texturefeatures were extracted for the first classifier and the features mentioned inSection 4 for the second classifier.

We first analyzed some of the features used in the second classifier. We foundthat entropy and the intersection entropy measures work well as a discriminantbetween segments that are correctly or incorrectly classified. This is shown in Fig-ure 6, where the left graph shows the division between correctly and incorrectlyclassified segments using entropy, while the right graph shows the separationconsidering the intersection entropy.

Fig. 6. The figures show how a sample of segments (100) for different classes are clas-sified, as correct (red triangles) or incorrect (blue circles), based on the segment localentropy (left) and the intersection entropy (right). Although there is not a perfect sep-aration, we can observe certain tendency in both graphs, which confirms that thesefeatures could be used as indicators to distinguish correct vs. incorrect region classifi-cation.

Then we compared the performance of the base classifier against the perfor-mance after the segments are selected based on the qualification classifier. Theresults in terms of true positives (TP) and false positives (FP), are summarizedin Figure 7. Although there is a slight reduction in true positives, there is agreat reduction in false positives; showing that our method can effectively elim-inate incorrect segments. The reduction in TP is not a problem, as there is aredundancy in segments by using multiple grids.

Finally, we compared our method for segment qualification based on a secondclassifier against the method proposed in [3] based on intersection of regions.

Fig. 7. Comparison of the results with only the first classifier and after the secondclassifier is applied. The graphs show the TP and FP generated by both classifiers.

For this we used the same data, and incorporated the intersection criteria intoour multiple grid segmentations, instead of the second classifier. The results interms of different performance criteria are summarized in Figures 8 and 9. InFigure 8 we compare the number of true positives (TP) and false positives (FP)for our approach vs. region intersection; and in Figure 9 we compared them interms of precision, recall and accuracy. To determine the selection of a segmentis considered that the likelihood rating was above 90% for TP and FP. Weobserve that our method for region qualification outperforms the method basedon intersections in all the performance measures, with significant differences inTP and precision. Examples of region labeling and qualification for 4 imageswith two different grid sizes are shown in Figure 10.

Our current implementation in MatLab requires about 30 seconds to pro-cess an image for the complete process (both phases) using a PC with a dualcore at 2.4 GHz and 4GB of RAM. We expect that an optimized “C/C++”implementation can reduce this time at least an order of magnitude.

These experiments show that our novel approach using a second classifierwith contextual information based on statistical measures, produces significantimprovements over the initial labeling, and is more effective that the approachbased on intersections.

6 Conclusions and Future Work

In this paper we proposed an automatic image annotation algorithm using amultiple grid segmentation approach. Images are segmented in a very efficientway using a simple grid segmentation with different grid sizes. Individual fea-tures based on position, color and texture are extracted and used for an initial

Fig. 8. Comparison of segment selection based on our approach vs. region intersectionin terms of true positives (TP) and false positives (FP).

Fig. 9. Comparison of segment selection based on our approach vs. region intersectionin terms of precision, recall and accuracy.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fig. 10. Sample images. Up: original image. Middle: coarse grid. Down: intermediategrid. Correct labels are in black and incorrect in red (best seen in color).

labeling of the different grid segments. This paper introduces a novel approachto combine information from different segments using a second classifier andfeatures based on entropy measures of the neighbor segments. It is shown howthe second classifier significantly improves the initial labeling process decreasingthe false negative rate; and has a better performance that a selection based onregion intersection.

As future work we plan to combine the correct regions in the different gridsto produce a final single segmentation and labeling; and to test our approach inother image databases.

Acknowledgments

We would like to thank to the members of the INAOE-TIA-research group bytheir comments and suggestions. This work was partially supported by CONA-CyT under grant 271708.

References

1. Viola, P., Jones, M.: Robust real-time face detection. Int. J. of Comp. Vision(2001)

2. Malisiewicz, T., Efros, A.A.: Improving spatial support for objects via multiplesegmentations. In: BMVC (2007)

3. C. Pantofaru, C.: Object recognition by integrating multiple image segmentations.In: Computer Vision – ECCV 2008. Lecture Notes in Computer Science (2008)481–494

4. J. Shi, J.M.: Normalized cuts and image segmentation. In Proc. CVPR (1997)pages 731–743

5. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature spaceanalysis pami (2002). IEEE Trans. Patt. Anal. Mach. Intell. 24 (2002) 603–619

6. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. Int.Journal of Computer Vision 59 (2004) 167–181

7. Shotton, J., Winn, J., Rother, C., Criminisi, A.: The msrc 21-class object recog-nition database. (2006)

8. Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascalvoc 2007 (2007)

9. Carbonetto, P.: Unsupervised statistical models for general object recognition.Master’s thesis, The University of British Columbia (2003)

10. Aksoy, S., Haralick, R.: Textural features for image database retrieval. CBAIVL1998, IEEE Computer Society (1998) 45

11. Chen, L., Lu, G., Zhang, D.: Content-based image retrieval using gabor texturefeatures. In: PCM 2000. Sydney, Australia (2000) 1139–1142

Documents

Automatic Image Annotation using Multiple Grid Segmentation