CHAPTER 2 LITERATURE REVIEWshodhganga.inflibnet.ac.in/bitstream/10603/33827/2/...image the emotional adjectives are applied to color images and fuzzy values are automatically obtained

31

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

An important task of Content Based Image Retrieval system is to produce

machines which search for images in the data base and retrieve the required image

faster and efficiently. The first and most an important task in CBIR application is

begins with the extraction of suitable visual features from the images. These visual

features form the base of all CBIR applications. There are various methods are

available and it was studied to extract different types of visual features from the

images. In fact, most of the early CBIR systems attempted to find best possible visual

features to represent different kinds of images for effective retrieval. In this chapter

the different types of visual features are discussed and the related works on each of

these visual features are reviewed.

Digitized images are stored in databases as 2-dimensional pixel intensities

without inherit meaning. Thus, extracting meaningful, useful and accurate

information from those raw data is the main issue. This process is called feature

extraction. The extracted features, also called image features. A feature is defined to

capture a certain visual property of an image, either globally for the entire image or

locally for a small group of pixels. The most commonly used features include those

reflecting color, texture, shape, and salient points in images. The existing image

features can be grouped into two categories:

Global features

Local features

In global extraction, features are computed to capture the overall

characteristics of an image and are represented in form of histograms or statistical

models. The advantage of global extraction is its high speed for both extracting

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

32

features and computing similarity. The popular global features include Edge

Histograms[LC01], MPEG-7 color descriptors [LC01] , Color Correlograms

[HRM97] , Local Binary Patterns (LBP) [TMT02] , Gabor filtering features

[MJW02], Color Moments [SO95], Color Histograms [SB91] , and Tamura

[TMY78] etc.

Local features are small sub-images or patches extracted from the original

image. In local feature extraction, a set of features are computed for every pixel using

its neighborhood. To reduce computation, an image may be divided into small, non-

overlapping blocks, and features are computed individually for every block. The

features are still local because of the small block size, but the amount of computation

is only a fraction of that for obtaining features around every pixel. The local features

can yield good results in various classification tasks [CWK05]. Local features have

some interesting properties for image recognition, e.g. they are inherently robust

against translation. These properties are also interesting for image retrieval.

2.2 Content-Based Image Retrieval using Global features

2.2.1 Colour

The color feature is one of the most widely used visual features in image

retrieval. It is relatively robust to background complication and independent of image

size and orientation. Some representative studies of color perception and color spaces

are available in literature [MMD76] [WYA97].

The color histogram is the most commonly used color feature representation in

image retrieval system. Statistically, the colour histogram denotes the joint probability

of the intensities of the three color channels. Swain and Ballard proposed histogram

intersection, an L1 metric, as the similarity measure for the color histogram [SB91].

To take into account the similarities between similar but not identical colors, Ioka

[MI89] and Niblack et al. [NB94] introduced an L2-related metric in comparing the

histograms. Furthermore, considering that most color histograms are very sparse and


33

thus sensitive to noise, Stricker and Orengo proposed using the cumulated color

histogram.

Stricker and Orengo [SO95] proposed using the color moments approach to

overcome the quantization effects, as in the color histogram. The mathematical

foundation of this approach is that any color distribution can be characterized by its

moments. Furthermore, since most of the information is concentrated on the low-

order moments, only the first moment (mean), and the second and third central

moments (variance and skewness) were extracted as the color feature representation.

Weighted Euclidean distance was used to calculate the color similarity.

To facilitate fast search over large-scale image collections, Smith and Chang

proposed color sets as an approximation to the color histogram [SC95]. They first

transformed the (R, G, B) color space into a perceptually uniform space, such as

HSV, and then quantized the transformed color space into M bins. A color set is

defined as a selection of colors from the quantized color space. Because color set

feature vectors were binary, a binary search tree was constructed to allow a fast

search.

In 1996, Histogram refinement [GR96] based on color coherence vectors was

proposed. The technique considers spatial information and classifies pixels of

histogram buckets as coherent if they belong to a small region and incoherent

otherwise. Color coherence vectors technique improves performance of histogram

based matching but computationally expensive.

Color correlogram feature for images was proposed in [HKM97] which take

into account local color spatial correlation as well as global distribution of this spatial

correlation. The correlogram gives the change of spatial correlation of pairs of colors

with distance and hence performs well over classical histogram based techniques.

Fuzzy color histogram [VB00] was proposed to revisit the use of color image

content as an image descriptor through the introduction of fuzziness, which naturally


34

arises due to the imprecision of the pixel color values and human perception. In 2000

they proposed the use of both fuzzy color histograms and their corresponding fuzzy

distances for the retrieval of color images within various databases.

Color histogram is an efficient tool, which is widely used in CBIR. The classic

method of color histogram creation results in very large histograms with large

variations between neighboring bins. Thus, small changes in the image might result in

great changes in the histogram. Moreover, the fact that each color space consists of

three components leads to 3-dimensional histograms. Manipulating and comparing

3D histograms is a complicated and computationally expensive procedure. To

overcome the above problem Konstantinidis [KGA05] proposed a method in which

projected the 3D histogram onto one single-dimension histogram is called histogram

linking. In their method, a new fuzzy linking method of color histogram creation is

proposed based on the L*a*b* color space and provides a histogram which contains

only 10 bins. The histogram creation method in hand was assessed based on the

performances achieved in retrieving similar images from a widely diverse image

collection. The experimental results prove that the proposed method is less sensitive

to various changes in the images (such as lighting variations, occlusions and noise)

than other methods of histogram creation. The query image and its fuzzy linked

histogram is shown in Figure 2.1.

Figure 2.1: Fuzzy linked histogram


35

Seong-Yong Hong. et .al. [HC06] utilized HSI values, which are converted

from RGB values, to generate the FMV-index values for image searching. HSI color

is described through the range from 0° to 360° which is shown in Figure 2.2.

Figure 2.2: HSI color model

Three color characteristics, hue (H), saturation (S), and intensity (I) or lightness (L),

are defined to distinguish color components. Hue describes the actual wavelength of

a color by representing the color name, such as red or yellow. Saturation is a measure

of the purity of a color, indicating how much white light being added to a pure color.

First, the RGB histograms are extracted from color images and are stored into

databases. These RGB values are converted into HSI values and FMV-index is

produced from it. FMV-index is stored in an emotional categories table. To search an

image the emotional adjectives are applied to color images and fuzzy values are

automatically obtained based on human visual features. The method FMV index

searching scheme does support semantic-based retrieval that depends upon human

sensation (e.g., “cool” images) and emotion (e.g., “soft” images), as well as

traditional color-based retrieval. Emotional concepts are classified into twelve classes

according to emotional expression and images are classified into these categories as

well. The image searching speed was improved by assigning FMV index value to

each class. As a result, more user-friendly and accurate searching, with emotional

expression, was realized in their experiment.

Color histograms are the most commonly used features for image retrieval due

to the good attributes of image translation and rotation invariance. However the


36

traditional color histogram features are usually short of spatial structural information

which may cause error retrieval results. In [XSB08] image retrieval was proposed

based on Weighted DCT Spatial Combination Histogram (WDSCH). The Equalized

DCT coefficients histograms for each divided image block are firstly constructed with

the DCT transformation and histogram equilibrium method. The Weighted DCT

Spatial Combination Histogram features for the entire image are then extracted in the

HSV color space with the weights of the spatial information and energy distribution.

Experimental results on the three image datasets with different conditions show that

the compact histogram features containing spatial structural relations are robust to

image translation, scaling and rotation, and can bring about good retrieval precision

and speed simultaneously.

Bo Gun et. al [BKS08] extracted the perceptual representation of an original

color image, a statistical signature by modifying general color signature, which

consists of a set of points with statistical volume. The authors also used dissimilarity

measure for a statistical signature called Perceptually Modified Hausdorff Distance

(PMHD) that is based on the Hausdorff distance. PMHD is insensitive to the

characteristics changes of mean color features in a signature, and theoretically sound

for incorporating human perception in the metric. To deal with partial matching, the

partial PMHD was defined, which explicitly removed outlier using the outlier

detection function. Their experimental result shows the retrieval performance of the

PMHD is, on average, 20–30% higher than the second highest one in precision rate.

Figure 2.3: An example of perceptual dissimilarity based on the densities of two color features.

In [SMM09] the image was split into Red, Green, Blue Components. For each

component the average was calculated and every component image was split to obtain


37

RH, RL, GH,GL, BH and BL images. The RH is obtained by taking only red

component of all pixels in the image which are above red average and RL is obtained

by taking only red component of all pixels in the image which are below red average.

Similarly the Color moments was applied to each divided component i.e. RH, RL,

GH, GL, BH and BL and was classified using K-means clustering algorithm.

In another approach [XZY09] applied quotient space granularity to retrieve

images. Image retrieval based on quotient space granularity theory includes two parts:

first, the attribute feature were obtained by describing images under different

granularities, then by combining attribute functions in different levels, an attribute

feature was formed. In their approach global color feature based on histogram and

local color feature based on DCT and SVD as attributes were considered under

different granularities to describe image color, and to combine the attribute functions

under different granularities to eventually realize image retrieval. According to their

experiment the precision curve of the 100 random queries when recall is 60%.

[SSM09] proposed a method in which invariant color histogram technique was

used for image retrieval. In their approach they used gradients in different channels

that weight the influence of a pixel on the histogram to cancel out the changes

induced by deformations. Their experimental result showed that the system was able

to retrieve the closest match when a rotated image was given as input. To reduce the

large variations between neighboring bins of conventional color histograms, fuzzy

color histograms are adopted which consists of only ten bins. However, their system

was less sensitive to various changes in the images such lightning variations and

noise.

Recently Rachid et. al. [RSM09] proposed an image retrieval method based

spatial color indexing methodology by using two descriptors. The first descriptor

CSE (Color Spatial Entropy) proposes a mechanism to transform local histograms

features into entropy feature, the second descriptor (Color Hybrid Entropy ) CHE

integrates spatial relationship of colors in multi-resolution images. The multi-


38

resolution images are generated using median and mean filters of different sizes. The

multi-resolution images were obtained with different filters such as mean, median,

Laplacian, Gaussian. Experimental results indicated that the proposed system were

quite robust, provide high precision in image retrieval system and takes more

querying time than the local histograms system. The multi-resolution local histograms

using the median filtering is shown in the following figure.

Figure 2.4 : Multiresolution Local Histograms

Abdelhamid Abdesselam et. al. [AWN10] proposed a CBIR where the spiral

representation of the color was extracted from the image to retrieve rotated images as

well as scaled images. To extract the spiral representation of the color content, the

image underwent a labeling process that assigns every pixel in the image closest color

from a predefined color table. The output of this CCM image (Color Cluster Mapping

image) was equally divided into m x m sub-areas and a Single Color Mapping (SCM)

image was derived for each predefined color. The Spiral Bit-string Representation of

Color (SBRC) was helpful in capturing the color information of an image as well as

its spatial distribution. Their experimentation showed that the use of SBRC in

retrieving rotated images was efficient and could be easily incorporated in real-life


39

CBIR systems. The disadvantage of their system was that the algorithm used in sub-

image retrieval had to be optimized before they can be used in real-life CBIR system.

H.B.Kekre et.al.[KST10] presented a image retrieval technique based on

augmentation of the colour averaging. The reflection of the original image was taken

across horizontal and vertical directions to get a flip image. The even part of the

image was obtained by adding original and flip images. In their experiment the

combination of original image with even part had a better performance than retrieving

images taking the original alone. On the other hand the combination of original image

with odd part showed the worst results. Therefore, the colour averaging techniques

like row & column mean (RCM), forward diagonal mean (FDM) and row, column &

forward diagonal mean (RCFDM) were applied on the original image and the even

part of image. The techniques were tested on a generic image database having 1000

images spread across 11 categories. For each proposed CBIR technique 55 queries (5

per category) are fired on the image database. Their experimental results showed

improved performance compared to the simple original image feature vectors.

Figure 2.5 a) : Example of Row & Column Mean

(RCM) vector

Figure 2.5 b) : Example of Forward Diagonal

Mean (FDM) vector

In an another approach H.B.Kekre et.al.[KD10] used Walsh transform to

generate the feature vector for content based image retrieval. The complex Walsh

transform was conceived by multiplying Sal functions by j and combining them with

cal functions of the same sequence. The angle was calculated by taking tan inverse of


40

sal/cal in the range of 0 to 360 degrees which was divided into 4 sectors. The mean of

real and imaginary values of these sectors in all three color planes considered to

design the feature vector of 24 components all together. The Walsh transform of the

color image was calculated in all three R, G and B planes. The experiment was carried

over database of 270 images spread over 11 different classes. The Euclidean distance

is used as similarity measure. Average precision and recall is calculated for the

performance evaluation. The overall average of cross over points of precision and

recall is above 50%.

2.2.2 Texture

Texture refers to the visual patterns that have properties of homogeneity that

do not result from the presence of only a single color or intensity. It is an innate

property of virtually all surfaces, including clouds, trees, bricks, hair, and fabric. It

contains important information about the structural arrangement of surfaces and their

relationship to the surrounding environment. Because of its importance and

usefulness in pattern recognition and computer vision, there are rich research results

from the past three decades. Many CBIR is proposed using texture feature in the last

three decade.

In the early 1970s, Haralick et al. proposed the co-occurrence matrix

representation of texture features [HSD73]. This approach explored the gray level

spatial dependence of texture. It first constructed a co-occurrence matrix based on the

orientation and distance between image pixels and then extracted meaningful statistics

from the matrix as the texture representation.

Many other researchers followed the same line and further proposed enhanced

versions. For example, Gotlieb and Kreyszig studied the statistics originally proposed

in [HSD73] and experimentally found out that contrast, inverse deference moment,

and entropy had the biggest discriminatory power.


41

Motivated by the psychological studies in human visual perception of texture,

Tamura et. al. explored the texture representation from a different angle [TMY78].

They developed computational approximations to the visual texture properties found

to be important in psychology studies. The six visual texture properties were

coarseness, contrast, directionality, linelikeness, regularity, and roughness. One

major distinction between the Tamura texture representation and the co-occurrence

matrix representation is that all the texture properties in Tamura representation are

visually meaningful, whereas some of the texture properties used in co-occurrence

matrix representation may not be (for example, entropy). This characteristic makes

the Tamura texture representation very attractive in image retrieval.

In the early 1990s, after the wavelet transform was introduced and its

theoretical framework was established, many researchers began to study the use of the

wavelet transform in texture representation [LF93] [GKL94] [TNP94].

In [SC94] Smith and Chang used the statistics (mean and variance) extracted

from the wavelet subbands as the texture representation. This approach achieved over

90% accuracy on the 112 Brodatz texture images. To explore the middle-band

characteristics, a tree-structured wavelet transform was used by Chang and Kuo in

[CK93] to further improve the classification accuracy. The wavelet transform was

also combined with other techniques to achieve better performance. Gross et al. used

the wavelet transform, together with KL expansion and Kohonen maps, to perform

texture analysis in [GKL94]. Thyagarajan et al. [TNP94] and Kundu et. al. [KC92]

combined the wavelet transform with a co-occurrence matrix to take advantage of

both statistics-based and transform-based texture analyses.

Ma and Manjunath [MM95] evaluated the texture image annotation by various

wavelet transform representations, including orthogonal and bi-orthogonal wavelet

transforms, the tree-structured wavelet transform, and the Gabor wavelet transform.

They found that the Gabor transform was the best among the tested candidates which

matched human vision study results [SC96].


42

In an approach Tomasz [TA05] proposed Gabor filters for extracting texture

features needed to characterize images in a database. First, image normalization was

performed to extracted salient points and texture features that do not change due to

contrast and illumination changes. Then multichannel Gabor filtering and the idea of

hierarchical image representation were presented and the texture features was

computed only in a neighborhood of salient points. In the extracted salient points

Regions of Interest (ROIs) were created in which color histograms and Zernike

moments was calculated. These features were used as information and images were

retrieved based on the similarities feature, where features of the query specification

are compared with features from the image database to determine which images

match similarly with given features.

In [VNS06] a shift-invariant complex directional filter bank (CDFB) was

proposed for texture image retrieval. By combining the Laplacian pyramid and the

CDFB, a new image representation with an over complete ratio of less than 8/3 are

obtained. The direction subbands’ coefficients are used to form a feature vector for

classification. Texture retrieval performance of the proposed representation is

compared to those of the conventional transforms including the Gabor wavelet, the

contourlet and the steerable pyramid. The over complete ratio of the proposed

complex directional pyramid is about twice that of the contourlet, and is much lower

than those of the other two transforms. The experiment shows that the technique able

to retrieve images with an overall precision 71.53 %.


43

Fig. 2.6: Frequency Partition of Directional Filter Banks: a) CDFB b) Gabor wavelets

c) Contourlet Transform d) Steerable Pyramid

Conventional Gabor representation and its extracted features often yield a

fairly poor performance in retrieving the rotated and scaled versions of the texture

image under query. To address this issue Ju Han [HM07] proposed a method using

rotation-invariant and scale-invariant Gabor representations, where each

representation only requires few summations on the conventional Gabor filter impulse

responses. The experiment was conducted on Brodatz data set. Features are extracted

from these new representations for conducting rotation-invariant or scale-invariant

texture image retrieval. Since the dimension of the new feature space is much

reduced, this leads to a much smaller metadata storage space and faster on-line

computation on the similarity measurement. The system proved to be robust for

rotation-invariant and scale-invariant texture image retrieval.

In another approach An P.N. Vo. et. al. [VON08] a new statistical model was

proposed for modeling the nature images in the transform domain. They demonstrate

that the von Mises distribution (VM) fits accurately the behaviors of relative phases in

the complex directional wavelet subband from different nature images. The image

feature was extracted based on the VM model and the texture was used for image

retrieval. For each image in the database were decomposed using the following three

decompositions: the curvelet, the 2-D Gabor transform and the CDFB. The Gabor


44

wavelet and the curvelet are applied with four scales and six orientations per scale,

while the CDFB has three scales of eight orientations. For each subband, the mean

and standard deviation of the absolute values of the coefficients are calculated and the

RP matrix of each complex subband was created to construct the RP based feature

vectors. For each query image, N nearest neighbors are selected, and the number of

textures belonging to the same class as the query texture, except for itself, is counted.

If the less than or equal to fifteen then it is assumed as the retrieval rate. The

performance of the entire class is obtained by averaging this rate over the sixteen

members which belong to the same class of texture. The texture database used in the

experiment contains 40 images from the VisTex databases. Each of these 512 × 512

images is divided into sixteen 128×128 non-overlapping sub-images, thus creating a

database of 640 texture samples. Their experimental result showed that the VM based

feature yielded higher retrieval accuracy compared to the energy features and the

relative phase features.

N . Gnaneswara Rao. et. al. [GVV09] presented CBIR based on wavelet

transformations. The Haar Wavelet transformation for feature extraction was to

captures the local level texture features. First, the texture feature vectors for all the

images in the database were computed using Haar wavelet Transformation. Then

images were clustered based on feature vectors using modified ROCK clustering

algorithm. Similarly, the feature vector for the query image was computed and

compared with indexed database to identify the closest cluster for the query image

and retrieved those images. The system operated partially at the primitive feature

level (ie texture) and failed to retrieve images with other features such as colour and

shape.


45

Figure 2.7: Wavelet Decomposition

Jianhua Wu. et. al. [JZY10] combined the color and texture feature for

content-based image retrieval. According to the characteristic of the image texture,

the information is represented by Dual-Tree Complex Wavelet (DT-CWT) transform

and rotated wavelet filter (RWF). The authors chose the color histogram in RGB and

HSV color space for the color feature. The experiment results show that their method

is efficient for simple images and failed to retrieve images of complex in nature.

Balamurugan et. al. [BAN10] in their approach the colour feature of each

query image was extracted and quadratic distance measure was computed to retrieve

similar images from the database. The 2D-DWT was applied to query and database

image, texture features of the query image and database compared using Euclidean

distance measure. The images with similar texture feature were selected and these

images were sorted using integrated approach. The relevant images from the color and

texture based retrieval were compared to get the most relevant image from the

database. Their system was able to retrieve relevant images without any

transformation and their precision recall rate is 98%.

More recently Kavitha et. al. [KBP11] proposed method in which both the

query image and the images in database were divided into 6- equal blocks. For each

block constructed HSV colour histogram and obtained statistic features (Energy,


46

Contrast, Entropy and inverse difference) from GLCM and constructed a combined

feature vector for color and texture. To obtain the similar images from the database

normalized Euclidean distance was used between feature vector of query image and

the feature vectors of target images. Their system retrieved the first 20 similar images

with minimum Euclidean distance.

Zhengli Zhu et. al. [ZCY11] presented an approach for rotation-invariant

texture image retrieval based on combined feature sets. The approach depended on

nonsubsampled contourlet coefficients, angular second moment, moment inertia,

inverse difference moment and correlation of grey co-occurrence matrices of the

images. Nonsubsampled contourlet transform had anisotropy and translation

invariability. Grey co-occurrence matrix of image reflected information about

direction, adjacency spacing, and range of variation. Combined feature sets included

more information of image. Their experimental result shows the system was able to

retrieve rotated images and the retrieval accuracy was 76.31% .

Figure 2.8: Two-level Nonsubsampled Contourlet Transform Decomposition

2.2.3 Shape

Shape is one of the most commonly used visual features in computer vision

and multimedia applications. Shape is relatively robust to background complication

and independent of image size and orientation. In general, the shape representations

can be divided into two categories, boundary-based and region-based. The former

uses only the outer boundary of the shape while the latter uses the entire shape region

[RSH96]. Many image retrieval systems were proposed in literature using shape.


47

The Fourier descriptor was used the Fourier transformed boundary as the

shape feature. To take into account the digitization noise in the image domain, Rui et

al. proposed a modified Fourier descriptor which was both robust to noise and

invariant to geometric transformations [RSH96]. The moment invariants were used

region which was invariant to transformations, as the shape feature.

In [MKL97] , Babu et al. compared the performance of boundary-based

representations (chain code, Fourier descriptor, UNL Fourier descriptor), region-

based representations (moment invariants, Zernike moments, pseudo-Zernike

moments), and combined representations (moment invariants and Fourier descriptor,

moment invariants and UNL Fourier descriptor).

In addition to 2D shape representations, there were many methods developed

for 3D shape representations. In [WM81], Wallace and Mitchell proposed using a

hybrid structural/statistical local shape analysis algorithm for 3D shape

representation.

Zhiyong Wang et. al [ZWCF00] presented a two-step approach of using a

shape characterization function called centroid-contour distance curve and the object

eccentricity (or elongation) for leaf image retrieval. Both the centroid-contour

distance curve and the eccentricity of a leaf image are scale, rotation, and translation

invariant after proper normalizations. In the first step, the eccentricity was used to

rank leaf images, and the top scored images were further ranked using the centroid-

contour distance curve together with the eccentricity in the second step. A thinning

based method was used to locate start point(s) for reducing the matching time. The

method used for retrieving images based on global shape information.

Image characterization with lesser number of features involving lower

computational cost is always desirable. Edge is a strong feature for characterizing an

image. Minakshi Banerjee presented a technique for extracting edge map of an image

which was followed by computation of global feature (like fuzzy compactness) using

gray level as well as shape information of the edge map [MK03]. Unlike other


48

existing techniques it does not require pre segmentation for the computation of

features.

C. Sheng et. al [SX05] in their approach a different measurement of shape

similarity, based on shape matrix which is invariant and unique under rigid motions

were used for image retrieval. Combined with the snake model, the original template

shape is deformed to adjust itself to the shape images. The deformation spent by the

original template to match the shape images and the matching degree is used to

evaluate the similarity between them.

S.Arivazhagan used Canny operator to detect the edge points of the image

[AGS07]. The contour of the image was traced by scanning the edge image and re-

sampling was done to avoid the discontinuities in the contour representation. The

resulting image was swept line by line and the neighbor of every pixel was explored

to detect the number of surrounding points and to derive the shape features.

Shan Li et. al. [LLP09] in their approach is based on a set of orthogonal

complex moments of images known as Zernike moments (ZMs). As the rotation of an

image has an impact on the ZM phase coefficients of the image, existing proposals

normally use magnitude-only ZM as the image feature. They combined both the

magnitude and phase coefficients to form a new shape descriptor, referred to as

invariant ZM descriptor (IZMD). The scale and translation invariance of IZMD were

obtained by pre normalizing the image using the geometrical moments. To make the

phase invariant to rotation, they performed a phase correction while extracting the

IZMD features. Experiment results show that the proposed shape feature is, in

general, robust to changes caused by image shape rotation, translation, and/or scaling

and was better than the commonly used magnitude-only ZMD.


49

Figure 2.9 : (a) Original image. (b) Reconstructed images using ZM(c) Reconstructed complex ZM

2.3 Content-Based Image Retrieval using Discrete Cosine Transform

Lay et al. [LG99] proposed the energy histograms of the low frequency DCT

coefficients as image features. The DCT coefficients are selected from 6 blocks. As

the low frequency DCT coefficients carry the most of the energy in DCT blocks, the 6

blocks contain different combinations of the DCT coefficients in low-frequency are

selected. Six DCT coefficient histograms extracted from these blocks are the image

features.

H. Yu [HY99] introduces a Q-metric for the DCT coefficients to measure the

image similarity. A sub-image including all DCT coefficients at each position in a

8×8 block is generated and the wavelet transformation is applied. While calculating

the distance between a query and a target image, the Q-metric counts the total number

of the coefficients which are higher than a predefined threshold in the corresponding

sub-images from both the query and target images. The overall distance is a weighted

summation of the counts from all sub-images. The Q-metric gives a measurement on

how many significant coefficients in two images are in common.

Ngo et. al. [NPC01] developed an image indexing technique via

reorganization of DCT coefficients in Mandala domain, and representation of color,

shape and texture features in compressed domain. Their work demonstrated

advantages in terms of indexing speed but with significantly sacrificing the retrieval

accuracy. As DCT compresses the image energy into lower order coefficients, they

only considered the first nine AC coefficients in an 8x8 DCT block and the variance

of these nine AC coefficients used to index the image. Although minimum numbers


50

of features are always desirable property for characterizing images but a single feature

failed to achieve desired accuracy.

In [NS04] the features were directly extracted from DCT domain. For each

color image of block size 8x8 in DCT domain a feature vector was extracted. Then,

feature vectors of all blocks of an image using the k-means algorithm was applied to

cluster into groups. Each cluster represents a special object of the image. Then some

clusters are selected that have largest members after clustering. The centroids of the

selected clusters are taken as feature vectors and indexed into the database. It

increases the size of the feature database and takes much time to index an image in

the database. Since image retrieval system was a subjective matter, evaluation of

retrieval performance is not reported.

Lu et al. [LB05] proposed a new method to extract a vector quantization index

histogram from the DCT coefficients. First, the 64 DCT coefficients in each 8 × 8

block are divided into four groups. For each color channel (Y, Cb, Cr), 4 codebooks

are trained from a randomly selected training image set. After obtaining the above 12

codebooks, images in the data set are processed in the same method as the training

images. The DCT sequences are divided into 12 groups, and encoded by the

corresponding codebooks. The indexes of the DCT coefficients from each codebook

are jointly put together as the quantization index histograms.

Lu et al. [LLB06] approximate the color and texture feature in the pixel

domain directly from the DCT coefficients. The color features of the image are

calculated directly from the DCT coefficients by partial decoding the JPEG image.

Each 8 × 8 DCT block was divided into 4 sub-blocks as shown in Figure 2.11. The

average color values of each block are denoted as M11, M12, M21, M22. The value of

M11,M12,M21,M22 are approximated through the four upper left coefficients in the 8 ×

8 DCT block. The texture information is a vector extracted from selective DCT

coefficients from 6 groups. Group 1 is the DC coefficient. Group 2 and 3 are the

frequency information. Group 4, 5 and 6 are the vertical, horizontal and diagonal


51

direction information. The mean and the standard deviation of all coefficients in each

group are extracted as the texture features. The Euclidean distance was used to

evaluate the distance between the query image and the images in the database.

Figure 2.10: Four sub-blocks of the 8x8 DCT block

In another approach Tsai et al. [TCH08] proposed a distance threshold pruning

(DTP) to alleviate computational burden of CBIR without sacrificing its accuracy. In

their approach, the images are converted into the YUV color space, and then

transformed into discrete cosine transform (DCT) coefficients. Benefited from the

energy compacting property of DCT, only the low-frequency DCT coefficients of Y,

U, and V components are stored. On querying an image, at the first stage, the DTP

serves as a filter to remove those candidates with widely distinct features. At the

second stage, the detailed similarity comparison (DSC) is performed on those

remaining candidates passing through the first stage. The experimental results showed

that both high efficacy and high data reduction rate can be achieved simultaneously

by using the this approach.

Kekre et al. [KST09] proposed another technique for image retrieval using the

color, texture features extracted from images based on vector quantization with

Kekre’s fast codebook generation which gave better discrimination capability for

Content Based Image Retrieval (CBIR). The database image is divided into 2x2

pixel windows to obtain 12 color descriptors per window (Red, Green and Blue per

pixel) to form a vector. Collection of all such vectors is a training set. Then the


52

Kekre’s Fast Codebook Generation (KFCG) is applied on this set to get 16 code

vectors. The Discrete Cosine Transform (DCT) is applied on these code vectors by

converting them to column vector. This transform vector is used as the image

signature (feature vector) for image retrieval. The method took lesser computations as

compared to conventional DCT applied on complete image. The method gives the

color-texture features of the image database at reduced feature set size. Proposed

method avoided resizing of images which is required for any transform based feature

extraction method.

In another approach H.B.Kekre [KST09] the data base image was divided into

2x2 pixel windows to obtain 12 color descriptors per window (Red, Green and Blue

per pixel) to form a vector. Collection of all such vectors is a training set. Then the

Kekre’s Fast Codebook Generation (KFCG) was applied on this set to get 16 code

vectors. The Discrete Cosine Transform (DCT) is applied on these codevectors by

converting them to column vector. This transform vector was used as the image

signature (feature vector) for image retrieval. The method takes lesser computations

as compared to conventional DCT applied on complete image. The method gave the

color-texture features of the image database at reduced feature set size. Proposed

method avoids resizing of images which is required for any transform based feature

extraction method. The image database used in the experiments is the subset of

Columbia Object Image Library (COIL-100) [NNM96] that contain 100 different

objects (classes). Each object was rotated in 72 different degrees, resulting in 7200

images in the database. To test the system 15 categories of images randomly selected

from the CIOL-100 database [NNM96] as a total of 1080 images. The Net Average

Precision/Recall vs Number of Retrieved images for all categories obtained were

82%.

Recently H.B.Kekre et al. [KTA10] based on feature vectors as fractional

coefficients of transformed images using DCT and Walsh transforms [KTA10]. The

feature vector size per image is greatly reduced by taking fractional coefficients of

transformed image. The feature vectors are extracted in fourteen different ways from


53

the transformed image. Along with the first being all the coefficients of transformed

image, seven reduced coefficients sets (as 50%, 25%, 12.5%, 6.25%, 3.125%,

1.5625% ,0.7813%,0.39%, 0.195%, 0.097%, 0.048%, 0.024%, 0.012% and 0.06% of

complete transformed image) are considered as feature vectors. The two transforms

are applied on gray image equivalents and the colour components of images to extract

Gray and RGB feature sets respectively. Instead of using all coefficients of

transformed images as feature vector for image retrieval, these fourteen reduced

coefficients sets for gray as well as RGB feature vectors are used, resulting into better

performance and lower computations. The results have shown the performance

improvement (higher precision and recall values) with fractional coefficients

compared to complete transform of image at reduced computations resulting in faster

retrieval. Finally Walsh transform surpasses DCT transforms in performance with

highest precision and recall values for fractional coefficients and minimum number of

computations up to 0.097% and then DCT takes over.

Fig 2.11: The Fractional Coefficients of Transformed images


54

In [ASN11] system exploits the global and regional features of the images.

The images are extracted using DCT and central moment. The idea of DCT was to

decouples the color component of image using YCbCr and transform these into DCT

coefficients and measure the similarity moreover extract the region based shape

descriptor to calculate central moment of an image with the help of edge information

and morphological operations to find normalized feature vectors. Then the similarity

calculated with combined features of quantized DCT color coefficients and

normalized feature vectors to retrieve an image.

The existing researches have mainly used the global features of an image for

image retrieval. The drawback of global features is losing much detail information of

the images, in case of looking for images that contain the same object or same scene

with different viewpoints. Therefore, they are usually not suitable for partial image

matching and object recognition or retrieval in cluttered and complex scenes. In order

to overcome this problem local features are used which extract only local features

from regions of interest or objects in the image without segmentation.

The famous local features are SIFT [GL99], PCA-SIFT [KR04], GLOH

[MS05], SURF [HAT08] , etc. Local features have been initially proposed to solve

problems in computer vision applications in applications such as wide baseline

matching [SZ02] , object recognition [FTV04] , texture recognition [LSP03] , robot

localization [LL02] , video data mining [SZ03] , building panoramas [BL03] , and

recognition of object categories [DS03] [FPZ03]. In recent years, the local features

are employed in many CBIR systems.

2.4 Content-Based Image Retrieval Scale Invariant Feature Transform

Xu Wangming et al.[WJX08] used SIFT in CBIR system. In this system the

visual contents of the query image and the database images are extracted and

described by the 128-dimensional SIFT feature vectors. The KD-tree with the best bin

first (BBF), an approximate nearest neighbors(ANN) search algorithm, was used to

index and match those SIFT features. A modified voting scheme called nearest


55

neighbor distance ratio scoring (NNDRS) was used calculate the aggregate scores of

the corresponding candidate images in the database and sorting the database images

according to their aggregate scores in descending order, the top few similar images

were retrieved as results.

In another system Sathya Bama et al., [SMRA11] used SIFT for computer-

aided Plant Image Retrieval method based on plant leaf images using Shape, Color

and Texture features for medical industry, botanical gardening and cosmetic industry.

The HSV color space was used to extract the various features of leaves. Log-Gabor

wavelet was applied to the input image for texture feature extraction. The Scale

Invariant Feature Transform (SIFT) was incorporated to extract the feature points of

the leaf image.

Anil Balaji Gonde et al., [AMB11] used Scale Invariant Feature Transform

(SIFT) feature for image retrieval was used. SIFT was used to extract the local

features of an image. The feature from the images can be extracted more accurately

by using SIFT than color, texture, shape and spatial relations. SIFT descriptor vectors

for each image was indexed by making the use of vocabulary tree. Further, relevance

feedback technique was used to bridge the gap between low level features and high

level concepts. The proposed method showed a significant improvement in precision

and average recall rate.

2.5 Content-Based Image Retrieval using Histogram of Oriented Gradients

Histogram of Oriented Gradients (HOG) [DT05] is a feature descriptor used in

computer vision and image processing for the purpose of object detection. The

technique counts occurrences of gradient orientation in localized portions of an

image. This method is similar to that of edge orientation histograms and scale-

invariant feature transform descriptors but differs in that it is computed on a dense

grid of uniformly spaced cells and uses overlapping local contrast normalization for

improved accuracy.


56

Megha Agarwal et al., [MM10] applied HOG feature descriptor for Content-

based Image Retrieval (CBIR). To measure the similarity large amount of database,

vocabulary tree was used. They performed the comparative analysis of retrieval

system based on HOG feature descriptor and Gabor transform feature descriptor. The

HOG-based retrieval system improved Average Precision (AP) and Average Recall

(AR) (56.75% and 38.45%, respectively) from Gabor-transform-based retrieval

system (41.20% and 25.41%, respectively). The experiment was performed on Corel

1000 natural image database.

2.6 Conclusion

In this chapter, literature reviews of the image features of CBIR systems using

the low-level global features are reviewed and the related work using local features

for image retrieval also reviewed. It is found that there is some CBIR system which

uses local features for image retrieval. Local features are widely used in many

computer vision applications. Therefore, the proposed systems are implemented using

local features since local features are robust, invariant to transformations etc.


Documents

CHAPTER 2 LITERATURE REVIEWshodhganga.inflibnet.ac.in/bitstream/10603/33827/2/...image the emotional adjectives are applied to color images and fuzzy values are automatically obtained