Compression and Indexing of Digital Images - UNISA · Dottorato di Ricerca in Informatica XIII ciclo Universit`adiSalerno Compression and Indexing of Digital Images Riccardo Distasi

Dottorato di Ricerca in Informatica

XIII ciclo

Universita di Salerno

Compression and Indexing of Digital Images

Riccardo Distasi

December 19, 2001

Coordinatore: Relatore:

Prof. A. De Santis Prof. G. Tortora

Contents

Front Matter i

Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Thanks! ix

1 Introduction 1

1.1 Guided Tour of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Speeding Up Fractal Coding: Split Decision Functions 3

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Split-Decision Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Fractal Indexing with Robust Extensions 9

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 The Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Basics of IFS image encoding . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 The Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Invariance and Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Contrast scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii

3.3.2 Luminance Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.3 Color Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.4 Rotations and Reflections . . . . . . . . . . . . . . . . . . . . . . . . 18


3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 A Hierarchical Representation for Image Retrieval 29

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 HER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Computing Time Evaluation . . . . . . . . . . . . . . . . . . . . . . 36

4.3 HER for Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Properties of HER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 HER for Textures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4.1 Invariance Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4.3 Comparison with a Wavelet Based Method . . . . . . . . . . . . . . 52


4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

References 65

A Additional Details 71

A.1 Fractal Index Invariance to Contrast Scaling . . . . . . . . . . . . . . . . . . 71

A.1.1 Center of mass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.1.2 Higher Deviates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.2 Fractal Index Invariance to Luminance Shifting . . . . . . . . . . . . . . . . 72

A.2.1 Center of Mass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Index 73

iv

List of Figures

2.1 Rate-distortion curves of the 512× 512 lena image obtained using a 4-levelquadtree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Obtaining the canonical form of an image. (a) Original image with key

points highlighted; (b) Image flipped vertically; (c) Final image rotated

into canonical form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Compression to SNR comparison between first and fire. . . . . . . . . . . 21

3.3 AVRR comparison between first, fire and PicToSeek. The horizontal line

represents ideal AVRR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Querying fire: contrast scaling invariance and robustness. . . . . . . . . . 23

3.5 Querying fire: luminance shifting combined with rotation and reflection. . 24

3.6 Querying fire: color change. . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.7 Querying fire: rotations and reflections. . . . . . . . . . . . . . . . . . . . . 26

3.8 Querying fire: elephants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.9 Querying PicToSeek: elephants. . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 A high level sketch of the her algorithm . . . . . . . . . . . . . . . . . . . . 34

4.2 Obtaining the her representation of a real-life 1-D input signal . . . . . . . 35

4.3 Sampling a 2-D contour at fixed angle increments can destroy information . 37

4.4 Sampling a 2-D contour pixel-by-pixel . . . . . . . . . . . . . . . . . . . . . 38

4.5 The correspondence between a contour and its her representation . . . . . 39

4.6 Approximation of a shape by the first M coefficients of its dft . . . . . . . 40

4.7 Approximation of a shape by its M largest maxima using her . . . . . . . 40

v

4.8 Converting a 2-d texture into a 1-d time series. (A): what the texture

looks like; (B): the partition element (texture tile); (C): the spiral; (D): the

resulting 1-d signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.9 Rotating the partition element yields different local maxima . . . . . . . . . 44

4.10 A selection of 256 tiles from the complete texture database utilized for the

experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.11 Results of a sample query: Metal.0000 (Element #2 in Fig. 4.10) . . . . . . 48

4.12 Results of a sample query: Bark.0000 (Element #1 in Fig. 4.10) . . . . . . 48

4.13 Distances from Fabric.0001 (#36 in Fig. 4.10) to the closest 250 matches

in the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.14 Relation of the energy fraction used to the number of maxima found (index

size) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.15 A graphical view of the outcome of texture-based retrieval performed on

the extended Brodatz data set . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.16 An example of retrieval using heri . . . . . . . . . . . . . . . . . . . . . . . 59

4.17 An example of retrieval using Euclidean Distance . . . . . . . . . . . . . . . 60

4.18 An example of heri’s ability to retrieve rotated versions of the query . . . . 61

4.19 An example of retrieval utilizing a moment-based technique . . . . . . . . . 62

vi

List of Tables

2.1 Speed-up achieved using adaptive entropy against rms on the 512× 512lena image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1 Invariance properties of her for contours . . . . . . . . . . . . . . . . . . . 42

4.2 Invariance properties for of her for textures . . . . . . . . . . . . . . . . . . 45

4.3 Tabular results of texture-based retrieval performed on the extended Bro-

datz data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 Quick comparison between hs and heat . . . . . . . . . . . . . . . . . . . . 56

4.5 Comparison between heri, Euclidean distance (ed) and a moment-based

technique (mbt) in terms of normalized recall . . . . . . . . . . . . . . . . . 58

4.6 Detailed comparative results for Euclidean Distance (ed), heri and moment-

based technique (mbt) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

vii

viii

Thanks!

I’d like to thank the people who, in one way or another, helped me to carry out my work.

First, a word of thanks is due to my advisor, prof. Genny Tortora, and to prof. Maurizio

Tucci. They make up the “there would be no thesis at all without them” category. They

gave me their trust, and I am very grateful for all the advice and encouragement they

provided. Next there is the “this thesis would be very different without them, and not for

the better” category: working with prof. Sergio Vitulano has been an enriching experience.

Several colleagues and friends also provided help in unique ways: in particular, the

long and frequent discussions with Michele Nappi were the nursery where many of the

ideas presented here were born or took shape.

Many more people helped, each in her or his own way. I cannot mention every single

name here, but you know who you are. Thanks.

Last but not least, a final word of thanks is due to my family, especially my parents.

Their constant support has been invaluable. They make up the most exclusive category:

“without them, there would be no thesis at all—and even the author would not be there.”

Chapter 1

Introduction

This thesis work is about a topic that is very ‘hot’ today: the compression and indexing of

digital images. Memories are going through a sharp drop in prices, and that is even more

true for mass memories. When we additionally consider the effective techniques available

for compression today, the cost of memorizing large databases of multimedia data—and

images in particular—has never been so low. However, as database size increases, the

main problem we are facing today is that of retrieval.

If we throw the Internet into the pot, we have even more examples of a large body of

image data which is accessible but not necessarily available to everyone, precisely because

the cost of effectively searching for a specific image is so high as to make retrieval nearly

impossible: in the worst case, we have to wade manually through a huge amount of data

obtained from some ‘image search’ engine.

In addition, it is important to devise systems that are usable by the general public.

This might seem obvious, but in fact many of the best working image indexing systems are

research prototypes, and as such have many parameters that must be accurately ‘tweaked’

in order to obtain the best results. Usability is a major goal, and not a trivial one at that.

For these reasons, finding more efficient ways to compress, index, and retrieve images

has become an active research field. Many researchers are investing their energies into the

quest of particular and general solutions to this important problem.

1.1 Guided Tour of this Thesis

Fractal image compression has proven to be a very effective way of achieving high compres-

sion ratios with very little distortion. The main problem with fractal-based techniques is

2 Chapter 1. Introduction

the long computing time required. Therefore, several acceleration methods are being inves-

tigated. The careful choice of a split decision functions is one of such methods. Chapter 2

describes one such possible choice, which is shown to achieve a significant speedup.

While being a useful compression technique, fractal coding can be successfully used to

index a database of images. The very same features that can describe an image in such

a way to achieve good compression—basically spatial relations among different image

regions—can be processed into a usable index. The resulting indexing system has inter-

esting properties: in particular, several useful invariances and the fact that the database

can be kept in (fractal-)compressed form throughout the whole process. This is addressed

in Chapter 3. The features that are relevant for fractal compression and indexing are not

easy to see to the human eye. Self-similarity is on a scale too microscopic and the match

for the ‘best’ self-similarity (best in the rms sense) is computed with a least-squares best

fit whose results aren’t often apparent to the eye.

Another approach to indexing an image database, quite differently, focuses on specific

human-perceivable features, especially object contours and textures. Chapter 4 describes

a representation for 2-d data such as contours and textures that, besides having many

desirable invariance properties, has the nice characteristic of being reasonably predictable

in its operation. As a consequence, the human operator who is not an expert in the field of

image processing can easily relate to the method, and the resulting usability of the system

is significantly enhanced.

Chapter 2

Speeding Up Fractal Coding: Split Decision

Functions

2.1 Background

In fractal image compression, an image is partitioned into a set of image blocks called

ranges. A pool of larger image blocks called domains is used as a codebook from which

ranges are approximated with affine mappings of the intensity value. It is common practice

to use square blocks for both ranges and domains, as well as to enlarge the codebook by

including all rotations and reflections of each domain. For further details, see [21] and [40].

Bit rates and image quality are strongly related to range block size: large ranges are

hard to approximate but lead to low bit rates, while small ranges are easily approximated

but yield higher bit rates.

In order to achieve both good fidelity and low bit rates, a possible solution employs

variable size partitions that tune range size to the complexity of the different image areas.

The most popular partition mechanism uses a quadtree scheme [21]: a square block is

recursively broken up in four quadrants until the resulting blocks are considered sufficiently

simple to be approximated by a domain chosen from the codebook. This decision is

usually taken by comparing the value of a function of the block—the so-called split-decision

function—against a given threshold. The choice of both function and threshold affects the

fidelity of the reconstructed image as well as the coding time.

We now attempt to provide a comparison between the most widespread split-decision

funcions and to propose a split-decision function based on entropy, also addressing the

choice of the threshold.

4 Chapter 2. Speeding Up Fractal Coding: Split Decision Functions

2.2 Split-Decision Functions

The aim of a fractal coder is to minimise the root-mean-square (rms) error between

the original and the transformed image. To this aim, the natural way of driving the

partition process is that of adopting rms error as the splitting function. The classical

split-decision function computes the rms error between the range R and the optimal

transformed domain D∗:

S1(R) = rms(R,D∗) =‖R −D∗‖√

n,

where n is the area in pixels. Using the function S1 implies that, before subdividing, every

attempt is made to encode bigger ranges, thus leading to optimal rate-distortion curves.

On the other hand, because of the unsuccessful attempts, a lot of computation is wasted.

To avoid this, it would be better to use a function which can be computed before even

trying to encode the current range block.

In [41], the authors consider a splitting function called n-fold variance that for a generic

block B (range or domain) is defined as

S2(B) = nVar(B) =n∑

k=1

(Bk − µ(B))2 ,

where µ(B) is the mean value of the intensities B1, . . . , Bn. This choice not only accelerates

the encoding process but also outperforms S1 in terms of rate-distortion curves. This might

sound strange, as S1 should be optimal from the point of view of fidelity. The explanation

is that S2 is an adaptive criterion, while S1 is not. In fact, the n-fold variance S2 takes

into account the size of the block: on the average, its value on a given block is 4 times

bigger than on a block of half its linear size. If the threshold is fixed for all levels of the

quadtree, then the subdivision of bigger blocks—which are more difficult to approximate

accurately—is favoured over that of smaller ones. The overall effect is a better rate-

distortion curve.

Using n-fold variance with a fixed threshold is equivalent to the use of standard variance

S3(B) = Var(B) =1n

n∑k=1

(Bk − µ(B))2

with an adaptive threshold Ti = 4Ti−1, where Ti is the threshold at level i of the quadtree.

The fact that adaptive thresholds improve image quality has also been observed in [8] for

rms-based split-decision functions.

Chapter 2. Speeding Up Fractal Coding: Split Decision Functions 5

Entropy is another interesting splitting function. For an 8-bit gray scale block R, it is

defined as

S4(R) = H(R) = −255∑k=0

fk lg(fk)

where fk is the frequency within R of the intensity value k. Like variance, entropy saves

computing time by avoiding computation for unsuccessful attempts.

Regarding threshold adaptivity, there seems to be a close correlation between the

split-decision function scaling factor on adjacent levels of the quadtree and the optimal

thresholds on those levels. In fact, for the variance criterion, Ti = 4Ti−1 yields the best

rate-distortion curve, while for rms the best curve is obtained by Ti = 2Ti−1.

For entropy, things are a bit different. Entropy is a convex function that reaches its

maximum when the frequency distribution is uniform. Since we are dealing with 8-bit

images, with 256 intensity values, the distribution inside a small block (e.g., 8×8) is likelyto be noticeably less uniform than the distribution in a larger block (e.g., 32 × 32): thenumber of possible pixel values is comparable to the number of pixels in a block. This

means that larger blocks tend to have higher entropy; entropy is thus intrinsically adaptive.

Furthermore, being a logarithmic function, it does not have a scaling factor. For all these

reasons, the optimal threshold on level i should depend not only on that on level i − 1,but also on the size of the block itself—or equivalently on i. In fact, we experimentally

obtained the best curve using the threshold relation Ti = Ti−1 + 1/i. However, both

entropy and variance share the difficulty of tuning the threshold value precisely as to

obtain the desired degree of fidelity.

2.3 Experimental Results

Experiments were made using 4-level quadtrees, with range block linear sizes of 32, 16, 8

and 4 pixels, on the 512×512 lena image. In order to give no advantage to any particular

method, we did not employ any lossy speed-up technique such as block classification.

Furthermore, we considered codebooks of the same size for all levels of the quadtree.

Indeed, when the codebook size increases with the block size, rms has a relative advantage;

when codebook size decreases, the advantage is on entropy’s and variance’s side.

Table 2.1 shows the speed-ups achieved using adaptive entropy against adaptive rms.

Each row shows an encoding of lena at about the same compression ratio (CR) and


Table 2.1: Speed-up achieved using adaptive entropy against rms on the 512× 512 lenaimage

CR Entropy rms Speed-up

snr snr

12 34.6 35.2 2.28

35 30.1 30.9 1.95

50 29.1 29.7 1.83

quality. Depending on the rate, the speed-up ranges from 2.28 to 1.83. Note that the

speed-up increases at lower bit rates; this reflects the fact that more computations are

avoided in the first levels of the quadtree.

Figure 2.3 compares the rate-distortion curves of the three analyzed splitting functions.

For each method there are two curves: one is obtained with a fixed threshold, while

the other is obtained with the best adaptive threshold relation. As expected, adaptive

rms is optimal, yielding the best rate-distortion curve. However, fixed-threshold entropy

outperforms standard rms because of its intrinsic adaptivity. Adaptive variance and

adaptive entropy are comparable and very close to the optimum—that is, adaptive rms.

From the plot it can be seen that adaptive thresholds provide a definite improvement

independently from the chosen splitting-decision function.

The main difference between variance and entropy is in the kind of noise added to the

reconstructed image: at low bit rates, variance causes low-contrast (“flat”) zones to get

coded poorly; by contrast, with entropy it is high-contrast (sharp-edged) zones that get

coded poorly. The latter kind of degrade is significantly less noticeable to the eye.

Comparing the two curves for each method in Fig. 2.3, we observe a dramatic improve-

ment brought by adaptive thresholds. This is even more noticeable from a visual point of

view. In fact, adaptive thresholds yield a visual quality far better than fixed ones. This

improvement is basically due to the almost complete lack of blockiness.

2.4 Conclusions

In this chapter, we compared several decision-splitting functions for quadtree-based fractal

image coders. In addition, the effect of using adaptive thresholds on the various levels of

Chapter 2. Speeding Up Fractal Coding: Split Decision Functions 7

26

28

30

32

34

20 30 40 50 60 70 80 90

PS

NR

(dB

)

Compression Ratio

Figure 2.1: Rate-distortion curves of the 512 × 512 lena image obtained using a 4-level

quadtree.

EntropyAdaptive entropy

RMSAdaptive RMS

VarianceAdaptive variance


the quadtree has been have investigated. The experiments show that adaptivity yields far

better rate-distortion curves and attenuates the effect of blockiness. However, it should

be pointed out that in the rms-error based criterion the threshold for obtaining a given

image quality is independent from the particular image. That is, for a given threshold

value we obtain almost the same visual image quality for every image. Unfortunately, this

is not the case with entropy- or variance-based criteria. This makes it difficult to obtain

an a priori estimate of the threshold value for encoding a given image at a given quality

level.

Chapter 3

Fractal Indexing with Robust Extensions

3.1 Background

Recently, the diffusion of multimedia computing systems has aroused a significant interest

for research in multimedia database management. In particular, the search of image

databases is a complex issue. Desirable performance characteristics of an ideal indexing

system include precise retrieval, small index size and easy computability. Currently, there

are several solutions to this problem, most of them tailored to a specific field of application.

As a broad distinction, homogeneous databases—such as those required by biomedical

applications—are characterized by having very small differences among the images (think

of an archive of liver CT-scans); the most effective approaches to date are based on object

contour shapes and spatial relationships among them [10]. The most popular tools of

the trade are Attributed Relational Graphs [36] and 2-D Strings [7, 9, 32]. Images in

heterogeneous databases, on the other hand, can be represented by coarser global features,

such as texture or color percentage [20,22]. Some systems combine the two approaches to

restrict the query answer set. A survey of content-based indexing systems can be found

in [12].

Fractal-based encoding—also called IFS-based encoding from the Iterated Function

Systems that are at its foundations—has already proven to be a reliable technique that

exploits the self similarity present in an image by representing it as a collection of affine

contractive transformations [21]. In much the same way, it is possible to utilize the same

collection of transformations as low-level features which can be organized into a signa-

ture that identifies the image and allows it to be retrieved. As a result, fractal encod-

10 Chapter 3. Fractal Indexing with Robust Extensions

ing is able to provide an effective indexing technique for heterogeneous image databases.

There is an added benefit: the whole database can be searched and manipulated with-

out ever decompressing the images. This ability makes fractal-based indexing suitable

to the handling of large databases. The efficiency of these systems—as opposed to their

effectiveness—depends heavily on the compression speed and also on how the low level

features extracted from the images are organized into indices. The techniques available to

avoid a linear search of the database include spatial access methods such as K-d-trees and

R*-trees [3, 4, 46] and general-purpose methods such as hash tables.

In particular, the image indexing technique called first [33] is based on fractal en-

coding. The features used for indexing are mainly the histograms of the most salient

contractive transformation parameters. These are organized in a spatial access structure

by means of R*-trees. The first system does reasonably well under the aforementioned

performance aspects and also provides good image compression ‘for free.’

However, it is often desirable to have an indexing system that is invariant—or at least

robust—to several image transformations: geometrical transformations such as changes

in the viewpoint or in the orientation of the object; pixel intensity transformations such

as those produced by a change in the illumination or in the sensitivity of the medium;

transformations due to transmission, such as added noise, and so on. first, while being

quite accurate and stable as far as image retrieval is considered, was not invariant to any

such transformation in a precise sense.

This chapter shows how a technique originally designed to speed up fractal compres-

sion [37] can be effectively employed to obtain desirable features (i.e., invariance) in a

fractal indexing retrieval system. The new system is an heir of first that significantly

improves on its efficiency and robustness, providing invariance under a large class of pixel

intensity transformations as well as isometrical geometrical transformations. As we shall

see, this is accomplished by modifying the relevant set of indexed features, removing irrel-

evant data and representing relevant data in a more abstract way. The experiments show

that these modifications in the index structure, besides providing the desired invariances,

do not cause any performance degradation; rather, the index is smaller and the quality

of retrieval is increased. The new system is called fire (Fractal Indexing with Robust

Extensions).

The chapter is organized as follows: Section 3.2 introduces the necessary concepts of

Chapter 3. Fractal Indexing with Robust Extensions 11

IFS-based encoding and then explains how the fire index is created; Section 3.3 then

discusses the invariance properties of the technique, providing proof sketches. Finally,

the technique is put in context: a few experimental results are shown in Section 3.4.

These results show fire’s improvement over its parent first and compare both with the

most similar (feature-wise) state-of-the-art method. Sections A.1 and A.2 in the appendix

contain a few details to complete the proofs of invariance that are sketched in this main

text.

3.2 The Technique

In order for this discussion to be self-contained, we briefly review the relevant concepts of

IFS encoding before explaining the details of index construction.

3.2.1 Basics of IFS image encoding

The image to be encoded is partitioned into a set R of blocks called ranges. Another

set D of image blocks called domains is used as a codebook to approximate each range by

means of an affine transformation. The domains, with sides twice as long as the ranges,

are shrunken to range size by pixel averaging. For each range r ∈ R, we have to find thedomain d and two real numbers α and β that give

mind∈D{min

α,β‖r− (αd+ β1)‖ }, (3.1)

where 1 is a constant block of intensity 1. This minimizes the root-mean-square (rms)

error in the approximation of r by an affine image of d:

r ≈ αd+ β1. (3.2)

The inner minimum in Eq. (3.1) can be computed directly as the solution to a least squares

problem:

α =

∑1≤i≤n1≤j≤n

(ri,j − r)(di,j − d)∑1≤i≤n1≤j≤n

(di,j − d)2; (3.3)

β = r − αd, (3.4)

where n is the side length of both range and shrunken domain, while r =∑

1≤i≤n1≤j≤n

ri,j/n2

and d =∑

1≤i≤n1≤j≤n

di,j/n2 are the average intensities in r and d respectively. Computation of


the outer minimum, however, is rather heavy if D is to be exhaustively searched. In order

to reduce the search space, the blocks (both ranges and domains) undergo a classification

process that yields a feature vector for each block [37]. Later, when approximating a

range r, we only search a domain class centered on r’s feature vector. The spatial access

method used for this search is based on R*-trees.

How are feature vectors computed? Given an n× n block b, we set k ← 0, b(k) ← b,

and we draw a similarity between pixel intensity and mass to compute the block’s mass

center coordinates:

xk =1Mk

∑1≤i≤n1≤j≤n

ib(k)i,j ; yk =

1Mk

∑1≤i≤n1≤j≤n

jb(k)i,j , (3.5)

where Mk =∑

1≤i≤n1≤j≤n

b(k)i,j is the block’s mass.

We then consider the ‘deviate’ block given by

b(k+1)i,j = (b(k)

i,j − µk)2, (3.6)

where µk =Mk/n2 is the average mass per pixel in b(k). We set k ← k+1 and go back to

Eq. (3.5) to compute the new mass center coordinates. This procedure can be carried out

an arbitrary number of times, but experiments have shown that twice (k = 0 and k = 1)

is enough to characterize the blocks adequately [14].

The resulting points (x0, y0) and (x1, y1) are then expressed in a polar coordinate

system whose origin lies in the block’s center, yielding (ρ0, θ0) and (ρ1, θ1); after discarding

ρ0 and ρ1, we are left with the final feature vector (θ0, θ1).

Once each range in R has been approximated by a domain in D, the image can be

encoded by the affine mappings from domains to ranges. The reconstruction of the original

image A can be accomplished by the iteration of this system of transformations, since it

converges to A regardless of the starting data [28].

Some further shortcut can be taken for multichannel images in YIQ form: by exploiting

the human visual system’s peculiarities, it is possible to encode the luminance channel

with maximum accuracy and then utilize the same range-domain correspondence also for

the encoding of the chrominance channels. In other words, referring to Eq. (3.1), once

the optimal d has been fixed by the luminance encoding, for the remaining chrominance

channels we only recompute the inner minimum over α and β: the hard outer minimum


over d ∈ D is computed only once. This yields a definite saving in both encoding time

and compressed length with a slight loss in terms of distortion.

In fractal encoding, it is customary to enlarge the domain pool by considering all

reflections and rotations of the domains along with the original versions. The fire systems

does utilize these block isometries for encoding, but not for indexing.

Another concept that is widely used in plain fractal encoding is that of variable, dy-

namic partitioning: the set R usually grows larger as a candidate range is seen to be

best approximated by being recursively divided. The whole process yields a hierarchical

partition, usually in the form of a quadtree [39]. However, how shall be seen later, fire

does not utilize dynamic partitioning for the sake of invariance to contrast scaling.

3.2.2 The Index

Once the image has been encoded, the index is obtained from the data gathered from the

encoding phase.

The index has a hierarchical structure, divided into two levels. The first level has to

do with the class distribution of the ranges composing the image, while the second level

depends on the specific affine mappings that encode them.

In order to classify the ranges, we quantize their (θ0, θ1) feature vector. The quan-

tized feature vectors are then histogrammed over all ranges, giving the first level of the

index. By tuning the number of bits utilized for quantization, it is possible to decide

how many classes there will be, and therefore control the size of the histogram. This is

the first important difference between fire and first: the previous system employed a

grid-quantized representation of the center of mass for each range rather than the polar

angle.

The second level of the index consists of another range-by-range histogram: that of the

(P(d), α) vector that represents the affine mapping utilized for the range’s encoding. The

domain is represented by the pixel position P(d) of its upper left corner, while α is the value

resulting from Eq. (3.3), appropriately quantized. This is another important difference

between fire and first: the latter utilized a 3-component vector including P(d), α and β,

while the new system only utilizes P(d) and α. As shall be seen, this is an improvement

over the previous version: the absence of β from the index, along with the adoption of


static partitioning, ensures index invariance under several pixel transformations to be

examined in the next section.

The two histograms would be too big to be a usable data structure, especially for large

databases; for this reason, the whole image index undergoes a discrete Fourier transform,

after which only the lower frequency coefficients are kept—typically 3 coefficients have

been empirically verified to be enough [1]. This operation significantly reduces the size of

the index.

Multichannel images—either YIQ or RGB—can be handled in two ways: we can either

have a distinct index for each channel or we may pick a ‘privileged’ channel to have a full

index and condense the indices for the remaining two channels.

In the case of YIQ images, the privileged channel is luminance: the index for the

luminance component has its full two levels, while the I and Q channels have the second

level only. This choice is motivated by two considerations:

• the necessity of keeping the size of index data reasonably small; the choice of lumi-nance as the privileged channel is justified by the nature of the human visual system,

which is much more sensitive to luminance information than it is to chrominance.

• the inner workings of the encoding phase, which, as stated in the previous section,has been designed to incorporate a similar shortcut: when dealing with 3-channel

images in the ‘cheaper’ way, the optimal domain for each target range is selected on

the basis of luminance information only.

RGB images are usually handled by having three separate full indices. However, with

a linear combination of the three channels, it is possible to build a luminance channel ‘on

the fly’ to be used as a privileged component and then choose arbitrarily two components

among R, G and B, building only the second index level for these two. One of the original

3 channels is therefore discarded altogether, but it can be reconstructed from the combined

luminance and the remaining two original channels when the image is decoded.

Examining how the index is calculated, note that identical encodings yield identical

indices. In other words, the index is entirely determined by the structure of the encoding

and contains no extra information. Therefore, any image transformation that leaves the

encoding unchanged obviously leaves the index unchanged. Furthermore, referring to the


affine approximation in (3.2), any variations in the β parameters alone, while changing

the reconstructed image, has no effect on the index.

3.3 Invariance and Robustness

Without loss of generality, let us consider a true-color (24-bit) n×n image A. Its red, greenand blue components shall be denoted by AR, AG and AB. The image transformations

that we are about to examine include the following pixel value mappings:

• Contrast scaling.The transformed image is given by

A′ = wA, (3.7)

where w is a positive real number.

• Luminance shifting.The transformed image is given by

A′R = AR +mR1; A′

G = AG +mG1; A′B = AB +mB1, (3.8)

where 1 is a suitably sized constant image of intensity 1, while mR, mG and mB are

arbitrary real numbers.

• Color change.The transformed image is given by

A′R = wRAR; A′

G = wGAG; A′B = wBAB, (3.9)

where wR, wG and wB are positive real numbers.

Due to the standard representation of digital images, it should be observed that what

really gets into the transformed image is min(A′, 255) rather than simply A′. Conversely,

in the case of luminance shifting with a negative offset, the transformed image gets

max(0, A′). In other words, the final result is clipped to the interval [0, 255]. Our dis-

cussion assumes that the parameter values are in such a range that the transformations

have linear effects—i.e., that no clipping occurs. Indeed, in the case of severe clipping,


all image information is destroyed and there is no method that can ensure invariance:

any image would simply turn into a series of 255’s. On the other hand, if ‘mild’ clipping

occurs at a few points, fire retains its robustness: the transformed image is no longer at

distance 0 from the original, but the distance is small enough for the transformed image

to rank high in the answer set. Generally, the main reason why the user wants invariance

to pixel-value transformations is the ability to deal with small discrepancies in the image

acquisition devices (e.g., scanner calibration). Therefore, it makes sense to assume that

in most practical cases the actual parameter values are indeed reasonably small.

In addition to pixel value mappings, we are interested in geometric image transforma-

tions:

• ‘Integer’ rotations by an angle ω multiple of π/2.Depending on the angle of rotation, the transformed image is given by

A′i,j = Aj,n−i+1 when ω = π/2; (3.10)

A′i,j = An−i+1,n−j+1 when ω = π; (3.11)

A′i,j = An−j+1,i when ω = π/2. (3.12)

• Reflections.According to the type of reflection, the transformed image is given by

A′i,j = An−i+1,j

‘vertical flip’

about the horizontal axis;(3.13)

A′i,j = Ai,n−j+1

‘horizontal flip’

about the vertical axis;(3.14)

A′i,j = Aj,i

‘NE/SW flip’

about the main diagonal;(3.15)

A′i,j = An−j+1,n−i+1

‘NW/SE flip’

about the secondary diagonal.(3.16)

3.3.1 Contrast scaling

Let us examine what happens when the image A undergoes a contrast scaling transforma-

tion as in Eq. (3.7). The first thing to be noticed is that the positions of the blocks’ mass

centers do not change. In terms of the intensity/mass simile, what happens is that the

density changes by a factor w, but, since this happens uniformly over the whole image, the

values of x0 and y0 in Eq. (3.5) remain the same as they were before the transformation.


When k is increased for the calculation of x1 and y1 (and possibly higher-order devi-

ates), there is a further scaling of the density by a factor of w, but this does not affect

the resulting x and y coordinates. Therefore, the whole calculation of feature vectors is

unaffected.

As for the rest of the encoding process, multiplicating all the blocks—both ranges and

domains—by the same scalar w > 0 simply means that the approximation in Eq. (3.2)

gets replaced by

r′ ≈ αd′ + β′1, (3.17)

where all the ‘primed’ quantities are multiplied by w. Of course, also the rms error is

scaled by w; however, the minimum error for a given range will be obtained with the same

domain and the same value of α as before, since all errors are scaled uniformly.

As a consequence, all ranges in the transformed image A′ will be encoded by the same

domains as in the original image A; in addition, the positions of the mass centers—and

therefore the (θ0, θ1) feature vectors—remain unchanged.

The bottom line is that the encodings of A and A′ differ only in their β offsets—in the

encoding of A′, they are all multiplied by w. However, β does not appear anywhere in the

index; what this implies is that the fire index is identical for A and A′, as desired.

It should be noted that the invariance of fire under contrast scaling transformations is

dependent on the adoption of a predetermined, fixed image partitioning: when using any

locally adaptive partitioning (such as the popular quadtree-based schemes), if w > 1 there

is no guarantee that the rms error in Eq. (3.17) cannot be improved upon by dividing the

range into smaller units. In this event, the whole encoding changes and so does the index.

3.3.2 Luminance Shifting

In the case of a luminance shifting transformation such as that described in Eq. (3.8),

the value of α that solves the least squares problem in (3.3) remains unaltered, but the

optimal value of β as given by Eq. (3.4) is affected by the appropriate additive constant

among those appearing in Eq. (3.8)—e.g, mR for the red channel and so on. As it happens

in the case of contrast scaling, however, the index is unaffected because β does not appear

in it.

Luminance shifting transformations also affect the position of the ranges’ centers of

mass, which might seem to affect the class histogram portion of the index. However,


recalling that the relation between the original block b and the transformed block b′ is

b′ = b+m1, it can be seen that the effect is that of moving the center of mass towards

the geometric center of the block. It is as if a slab of constant density were attached to

the bottom of the block. The heavier the slab gets, the more the center of mass moves

toward the geometric center.

What this means is that the angle θ is unvaried; it is the distance ρ that gets smaller.

Since only θ is used for classification and index construction, the technique is invariant un-

der luminance shifting. Unlike the case of contrast scaling, for this type of transformations

the invariance is not even dependent on the adoption of a fixed-partitioning scheme.

3.3.3 Color Change

Examine a color change transformation such as described in Eq. (3.9). On a channel-by-

channel basis, this is mathematically identical to a contrast scaling, and therefore the fire

technique is invariant to color change as it is to contrast scaling as far as single-channel

8-bit images are considered.

However, in the case of multi-channel images, the technique stays invariant only as

long as each channel has its own index. In other words, it is not possible to condense

the indices for different channels into a single index while keeping fire invariant to color

change.

3.3.4 Rotations and Reflections

The first thing to be noticed is that any combination of reflections and integer rotations

can be obtained by a suitable subset of transformations that include the four rotations

and reflection about the vertical axis. For this reason, we shall restrict our attention to

these transformations.

The approach used in fire to handle rotated and reflected versions of the query image

is that of reducing all the images in the database to a ‘canonical form,’ that is, one

particular rotation and reflection among all the possibilities.

In order to choose a canonical form for the image A, we consider three key points: (1) Its

center of mass P = CM(A); (2) The center of mass of the deviate image Q = CM(A(1));

(3) The geometric center O.

The canonical form of an image is defined by the following two properties:


ϕP

OQ ϕ P

OQP O

Q

ϕ

(a) (b) (c)

Figure 3.1: Obtaining the canonical form of an image. (a) Original image with key points

highlighted; (b) Image flipped vertically; (c) Final image rotated into canonical form.

1. the point P lies in the northwestern quadrant of the image.

2. the oriented angle ϕ = POQ satisfies 0 ≤ ϕ < π.

The first property selects the right rotation, while the second selects the right reflection.

Each image A, therefore, undergoes the following normalization procedure: first, Prop-

erty 2 is enforced by flipping A if necessary; then, Property 1 is enforced rotating A by the

appropriate multiple of π/2 as to bring its center of mass P in the northwestern quadrant.

This process is illustrated in Fig. 3.1.

As a result, all isometric variants of A yield the same index. Therefore, the fire index

is invariant under integer rotations and reflections.


This section focuses on the improvements achieved by fire over its parent, first. Since

one of the characteristics of these two systems is that of providing both image compression

and image indexing, our tests have been directed to both aspects. Additionally, in order

to assess fire’s performance with respect to methods based on different theoretic foun-

dations, we also have compared fire to PicToSeek [23]. The choice fell on this method

because, similarly to fire, it has been designed with the purpose of providing several

invariance properties.


The fire system has been implemented as a Java program under Windows 98. The

hardware platform is a Pentium II based PC. All the tests have been performed on a

heterogeneous database of about 2000 256×256 images at 24 or 8 bits/pixel that includesthe following categories: tools, animals, human faces, CT-scans, NMRs, landscapes, art

images and pasta. The database has grown over time by accumulation and basically con-

sists of the Smeulders dataset [23] and the Sclaroff dataset [42,43], plus additional images

from our own test dataset. In order to verify the invariance and robustness properties

of the system, the database also contains several transformed versions of some of the im-

ages. The transformed versions have been obtained from the originals by applying various

combinations of pixel value mappings and rotation/reflection.

The quality of the encoding can be evaluated by plotting a measure of distortion against

a measure of compression. We utilize SNR (signal-to-noise ratio) as a measure of inverse

distortion and the bit rate in bits per pixel as a measure of inverse compression.

As can be seen from Fig. 3.2, there is an improvement in compression that derives from

fire’s better classification scheme: the former system utilized a modified grid method in

Cartesian coordinates instead of the present quantized polar coordinates. The (θ0, θ1)

feature vectors have been quantized to 6 bits, yielding 64 classes. The classification im-

provement even makes up for the weaker partitioning scheme of fire, which has abandoned

quadtree dynamic partitioning, resorting to a fixed partitioning scheme. The data shown

in Fig. 3.2 have been gathered by averaging the results over a representative selection

including about 20 images for each category in our database.

As for the discriminating ability, this also has improved significantly. It should not be

surprising that better compression goes along with better indexing: in fact, better com-

pression for equal SNR means that the internal self-similarity of the image is represented

more accurately; this allows it to be exploited more effectively also for indexing.

There are several measures of the accuracy with which images can be retrieved; the

one utilized in this chapter is defined thereafter.

The averaged rank over the set of relevant images (AVRR) is just what its name

implies: given a query, consider an answer set S of predetermined size—say p. Then

the ideal answer set contains the p “best” images for that query, ranking therefore from

0 to p − 1. As customary in image indexing, the ranks are subjectively assigned by anexternal human operator with some experience in the field of application from which the


31

32

33

34

35

36

37

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

PS

NR

(dB

)

BIT RATE (bpp)

FIRSTFIRE

Figure 3.2: Compression to SNR comparison between first and fire.

images are drawn.

In general real life, of course, the obtained ranks will be worse than optimal. Their

average is just the AVRR of the answer set S. In other words,

AVRR(S) =1p

∑s∈S

rank(s), (3.18)

where S has cardinality p. The best AVRR is therefore the lowest. The lower bound for

AVRR is called ideal AVRR (IAVRR) and depends on p only:

IAVRR(p) =1p

p−1∑i=0

i =p− 12. (3.19)

Fig. 3.3 shows how the two systems perform as for AVRR. The results shown here have

been obtained by averaging 15 different queries over databases of several sizes with answer

set size p = 20.

These measures of accuracy should be interpreted with some caution, because they

depend heavily on the particular database being used for the tests. To make matters

worse, the researchers in this field have not yet agreed on a standard database to be

employed for this kind of tests. However, as database size increases, the measures get

more and more reliable in describing performance.


8

9

10

11

12

13

0 200 400 600 800 1000 1200 1400 1600 1800

AV

RR

DATABASE SIZE

FIRSTFIRE

PicToSeekIAVRR

Figure 3.3: AVRR comparison between first, fire and PicToSeek. The horizontal line

represents ideal AVRR.

Fig. 3.4 shows the outcome of a query. The query image is in the upper left, while

the 11 best ranking images in the answer set follow. In this case, our database contained

8 transformed versions of the query image, and all are retrieved. The first retrieved images

are at distance 0 from the query; starting from the second image in the second row, the

distance is different from 0, since for these images the contrast has been increased beyond

the clipping point. From the third row on, the retrieved image is a different one altogether

(a bear instead of the monkey), along with a few transformed versions. Fig. 3.5 shows

invariance/robustness to luminance shifting combined with rotation and reflection, while

Fig. 3.6 shows color change.

A query returning mainly rotated and reflected versions of the query image is depicted

in Fig. 3.7. In this case, too, the first two rows return all the 0-distance matches in an

arbitrary order, while the last row shows the next best matches.

To get a feel for the different personalities of fire and PicToSeek, look at Figs. 3.8

and 3.9. Here, the two systems are queried with the same elephant image. As can be

seen, fire (Fig. 3.8) acts like color-blind and returns a 0-distance match first, even if

the colors and the orientation are radically different. Furthermore, notice how the first

three images in the second row are very similar to a reflection of the query image. On the


Figure 3.4: Querying fire: contrast scaling invariance and robustness.


Figure 3.5: Querying fire: luminance shifting combined with rotation and reflection.


Figure 3.6: Querying fire: color change.


Figure 3.7: Querying fire: rotations and reflections.


Figure 3.8: Querying fire: elephants.

other hand, PicToSeek (Fig. 3.9) with the ‘color invariant’ option selected is much more

attentive to hue values. As a result, all the returned images show a very similar cumulative

color distribution, which is not the case with fire. A ‘nearly-rotated’ version of the query

image does appear (first place in the fourth row), but not before several lions.

As the last example suggests, the indexing method that should be preferred is largely

dependent on the specific application and user needs, especially when considering that the

two systems have similar overall performance as measured by AVRR.

3.5 Conclusions

This chapter has illustrated a system for image indexing that is based on fractal image

coding. The images are always kept in compressed form, and in fact the encoding itself

provides the information that makes up the index. For this reason, the indexing method

proposed is suitable for use with large image databases.


Figure 3.9: Querying PicToSeek: elephants.

The system has been designed to be invariant to three classes of pixel intensity trans-

formations: contrast scaling, luminance shifting and color change, and to all isometric

transformations such as rotations, reflections and their combinations.

These features make the system robust with respect to a wide class of image distortions

that are likely to happen in real applications.

The experiments show that fire performs adequately to be employed with large image

databases: the compression it achieves with fractal encoding is nearly top notch and its

retrieval accuracy compares well with today’s standards.

Chapter 4

A Hierarchical Representation for Image Retrieval

4.1 Background

In the last few years, due to the steady progress of multimedia processing, the interest of

the scientific community for multimedia database systems has significantly grown. In par-

ticular, the study of effective representations suitable for obtaining approximate retrieval

by content has received great attention [24,38].

Human beings are extremely performant when it comes to the recognition of objects

independently from their position and orientation. Finding a solution for the same prob-

lem in machine vision, however, turns out to be a very complex and difficult task. The

main task of pattern recognition is that of comparing a measured image in an unknown

position to different prototypes. We get a direct brute force solution to this problem if

we compare the prototypes in all possible positions and extract the optimal coincidence.

If we use Euclidean distance for comparison (which under certain assumptions yields a

maximum likelihood estimator), we end up calculating the maximum of a high order cor-

relation function, which is a rather time consuming operation. The time required grows

exponentially with the number of parameters describing the coordinate transformations

induced by the motion.

A more elegant way to solve the problem involves the use of mappings that are able to

extract position-invariant intrinsic features of the object. The method of Fourier descrip-

tors is known to work reasonably well for the recognition of object contours independent

of position, orientation and size [25]. There are works that show the results of the Fourier

approximation of polygons for different numbers of Fourier coefficients [49]. As it turns

30 Chapter 4. A Hierarchical Representation for Image Retrieval

out, it is possible to achieve a good approximation of a polygon by using 15–30 coefficients.

Even with few coefficients, the Fourier series obtain an acceptable approximation to the

original curve because the low frequencies contain the most significant information about

the object.

Other techniques recur to the minimization of the contour’s moments with respect to

an orthogonal coordinate system centered in the object’s center. Generally, only the first

two moments are used: as pointed out in [30], higher-order moments add little information

content. However, this approach does not appear to be particularly effective: indeed, it

requires a great amount of information and long computing times.

In this chapter, we present a novel time-series indexing system, heri (Hierarchical

Entropy-based Representation for Indexing), useful for efficient retrieval by content. heri

is based on her [15,19] (Hierarchical Entropy-based Representation), which is employed in

order to describe a 1-D signal by means of a few coefficients. As a matter of fact, effective

retrieval by content implies a representation of pictorial data that is approximate in order

to save computing time during retrieval, but still retains enough relevant information to

allow for a discriminating retrieval. Many techniques, therefore, compress information

deemed to be ‘relevant’ into a few coefficients.

her too falls into this category. The particular method used by her reconstructs the

energy distribution of the given signal along the independent variable axis. The focal point

of this technique is that we select the most relevant local maxima based on the area, and

therefore the energy, associated with each maximum. An interesting consequence is the

generality of the resulting representation. Indeed, heri is a good candidate for content

based retrieval whenever the information can be accurately represented by a 1-D signal.

The structure of the chapter is as follows. In Section 4.2, we present the theoretical

foundation underlying her and describe the correspondence between the input signal and

the representation vector. We also outline the link between the properties of an input shape

and the resulting coefficients. Moreover, the invariance properties of heri are discussed.

Section 4.5 shows the results of our experiments in terms of well-known objective measures.

Finally, Section 4.6 contains a few conclusing remarks.

Chapter 4. A Hierarchical Representation for Image Retrieval 31

4.2 HER

her is based on a subset of the 1-D signal samples (namely, the signal’s local maxima)

and on their associated energy. The interesting peculiarity of this representation is that it

allows us to reconstruct the signal’s energy distribution along the time axis by using only

a few coefficients.

Let x(n) be a monodimensional signal, time-discrete and finite in the time domain, i.e.,

x(n) �= 0 for n ∈ [0, N − 1]. Let us define the energy of the i-th sample as E(i) = |x(i)|2.The total energy of x(n) can be then defined as the sample by sample sum of the individual

energies:

E =N−1∑i=0

E(i) =N−1∑i=0

|x(i)|2. (4.1)

Now, consider the difference operator ∆i = x(i)−x(i−1). A local signal maximum occurs

whenever ∆ becomes negative:

x(k) local max ⇔ ∆k ≥ 0 and ∆k+1 < 0. (4.2)

If we have several adjacent samples x(k), . . . , x(k + () such that ∆k = . . . = ∆k+ =

0, then we have a signal plateau. In this case, if ∆k+ +1 < 0, we pick the plateau

midpoint x(i+ (/2) as the local maximum.

Suppose x(i) is a local maximum for the signal under study; we compute its relative

energy, weighted by total signal energy, as

Er(i) =E(i)

E − E(i)E(i). (4.3)

We now consider x(i) as the midpoint of a Gaussian distribution. The choice of as-

sociating a Gaussian function with each maximum has been made after considering the

following facts:

• Given a set P of points, for large |P |, if the number of local maxima m is such that

m� |P |, then any distribution tends to a Gaussian.

• If the contour is corrupted by 0-mean Gaussian noise, then a Gaussian distributionis obviously the best fit for the resulting distribution.

• The experiments have shown that choosing a distribution other than a Gaussian haslittle effect on both the energy associated to each maximum and the final normalized

recall achieved.


• The symmetry of Gaussian distributions allows us to associate with each maximumthe energy contained in an interval that has the maximum as its midpoint in a

natural way.

The standard deviation of the Gaussian associated with maximum in x(i) is then

σ(i) =1√

2π (Er(i))2. (4.4)

We then calculate the entropy relative to the maximum x(i) as the quantity

S(i) =1x(i)

σ(i)∑k=−σ(i)

|x(i + k)|. (4.5)

S(i) can be considered as a relative measure of the signal energy in the range [i−σ(i), i+σ(i)] with respect to the energy in x(i). If a signal has m maxima, its total entropy is

therefore

S =m∑

i=1

S(i). (4.6)

We now consider a vector x containing the m maxima of x(i) in decreasing order:

x ≡ (x(i1), . . . , x(im)), with x(i1) ≥ x(i2) ≥ · · · ≥ x(im). (4.7)

Then, the representation y of the signal x is ultimately obtained as the union of all intervals

around the maxima appearing in x.

y =m⋃

k=1

[x(ik − σk), x(ik + σk)]. (4.8)

The signal y is uniquely determined by the first m triples of the vector y, containing

all the maxima and their associated energy and defined as

y = i1, E(i1), [x(i1 − σ1), . . . , x(i1 + σ1)],

. . . , . . . , . . . ,

im, E(im), [x(im − σm), . . . , x(im + σm)],

. . . , . . . , . . .

(4.9)

The vector y is the her representation of the signal x. More formally, supposing we

have a time series x(·) with N points, let us define the energy of the i-th sample as

E(i) = |x(i)|2. The total energy of x(·) is simply E =∑N−1

i=0 E(i), while the relative


energy of x(i) is Er(i) = E(i)2/(E −E(i)). A summary of the whole process is presented

in Figure 4.1, which shows a high level description of the algorithm to obtain the her

form of a given signal.

An alternate form for Step H6 stops iterating when the fraction of the total energy

remaining in the signal x(·) falls below a given threshold. In most cases, the alternate test

offers more control on index accuracy at the expense of unpredictable index size. In order

to perform our tests with preset index sizes, the simpler ‘number of maxima’ test has been

preferred.

Another possibility is that, instead of calculating σ in Step H4, it can be treated as

a parameter and therefore set to the same fixed value for all the maxima. This has the

effect of establishing an overall minimum spacing distance between consecutive maxima.

Figure 4.2 shows a specific example of how the her representation is obtained starting

with a real-life input signal.

In order to perform an approximate comparison between two given signals x1(·) and x2(·),we can compute the distance between their her representations y1(·) and y2(·) or, equiv-alently, between the vectors y1 and y2, defined as follows:

D(y1, y2) =∞∑i=0

|y1i − y2i|. (4.10)

The distance between any signal and itself is obviously equal to zero.

It should be stressed that her is not meant to exactly reproduce the input signal, as

would be appropriate for strict signal compression applications. Rather, her extracts a

number of features that can be used to retrieve signals similar to the one under consider-

ation. Even the case σ = 0 does not imply that every single signal point will be sampled

by her: the algorithm always limits itself to the signal maxima. If, on the other hand,

we select all the points (not only the maxima) and we define the instantaneous energy

density as

ρE(n, () =12(

∑i=−

x(n+ i), (4.11)

then ρE does converge to the original signal:

lim →1

ρE(n, () = x(n) (in l2). (4.12)


The HER Algorithm

Here is how the her representation vector y of the sequence x(i), 0 ≤ i ≤ N − 1 isobtained.

H1. [Initialize counter and ouput vector. Compute the total energy.]

k ← 0; y = ( ); E =∑N−1

i=0 |x(i)|2

H2. [Find the m signal maxima and put them in a queue Q in decreasing magnitude

order, along with their x-axis position and their 4-distance from the first (largest)

maximum.]

Q←((i1, 0, x(i1)), . . . , (im, dm, x(im))

);

H3. [Pop the largest maximum from Q.]

(t, x(t))← pop(Q);

[Compute its relative energy Er(t).]

E(t) = |x(t)|2; Er(t) = E2/(E − E(t));

H4. [Compute the standard deviation relative to the current maximum.]

σ(t) =1√

2π (Er(t))2;

[In other words, we are considering x(t) to be the midpoint of a Gaussian

distribution. Now compute its relative entropy.]

S(t) =1x(t)

σ(t)∑i=−σ(t)

|x(t+ i)|;

H5. [Append the newly found values to the her output vector y.]

y = y ⊗ (x(t), S(t), dt);

H6. [Go back to Step H3 until we have used a predefined number M of maxima.]

k ← k + 1;

If k < M go to Step H3, else output y.

Figure 4.1: A high level sketch of the her algorithm


Figure 4.2: Obtaining the her representation of a real-life 1-D input signal


4.2.1 Computing Time Evaluation

A cursory analysis of the algorithm shows that its time complexity is basically linear in

the number of points N . More precisely, let us consider an N -pixel signal. Assuming that

the number of relevant maxima is m (typically m� N), Step H1 requires constant time,

while Step H2 can be carried out by performing one sequential scan of the N -pixel input

and sorting the local maxima—that is, time c1N + c2m logm. The loop including Steps

H3 through H6 is executed m times. Steps H3, H5 and H6 require constant time, while

Step H4 takes a time proportional to σ(t), which has N as a tight upper bound—for all

practical purposes, however, σ(t) can be considered as a constant. This yields a running

time of c3 + c4σ(t) for one iteration. Therefore, the total running time for the algorithm

is O(N +m logm+Nm). Since m is usually fixed to some low value such as 5 or 6, the

whole expression reduces to O(n).

4.3 HER for Contours

We can now apply the proposed model to signals in order to analyze and classify closed

contours of objects and regions of a pictorial scene.

We first need to obtain a 1-d representation of a 2-d contour. In order to obtain a

1-d time series from 2-d contour data, two different approaches come to mind. The first

approach is that of sampling the contour at fixed angle increments, recording the distance

between the contour and the mass center. This approach has the advantage of keeping

the index small because, once the angle increment ε is fixed, the number of sample points

is 2π/ε for any contour. However, this approach has at least one serious shortcoming:

for non-convex contours such as the one depicted in Fig. 4.3 (A), there are several angles

where there is no single contour point: for the angle θ, there are 3 points that intercept a

straight line starting from the center O. As a consequence, at least 2 out of these 3 points

cannot be represented, and the reconstrucion is inevitably wrong (B). The exact type of

error depends on the strategy adopted in the case of multiple points for the same angle—do

we record the minimum, maximum or average distance?

The second approach involves scanning the contour pixel by pixel rather than angle

by angle. One conceivable disadvantage is that our representation will have one point for

each contour pixels; therefore, the data size can get large for images at high resolution.


θO

O

(A) (B)

Figure 4.3: Sampling a 2-D contour at fixed angle increments can destroy information

However, the advantage is that any contour can be represented in a lossless, reversible

way. For this reason, this approach has been preferred. We scan the contour clockwise

starting from its top left pixel, recording the distance between each pixel and the center

of mass, as shown in Fig. 4.4. The contour (A) is sampled pixel by pixel and this yields a

periodic time series (B) with as many points as there are pixels in the object contour.

To do so, we choose our frame of reference to be a coordinate system centered in the

center G of the object under exam, computed as follows:

(xG, yG) ≡(1k

k∑i=1

xi,1k

k∑i=1

yi

), (4.13)

where xi and yi are the coordinates of a pixel Pi belonging to the contour, while k is the

number of contour pixels.

Next, we compute the d4 distance between the center G and the uppermost and left-

most pixel P1 of the contour:

d4(P1, G) = |x1 − xG|+ |y1 − yG|. (4.14)

Repeated application of Eq. (4.14) for all contour pixels according to a predefined direction

(e.g., clockwise) yields a representation γ(s) of the contour in curvilinear coordinates. Such

a representation is univocal, since it is possible to reconstruct the original 2-D contour

shape without loss of information.

If we apply the model we have just described to γ(s), we observe that the maxima

of γ(s) correspond to the points of the contour having the greatest distance from the


N/2

(A)

(B)

N

Figure 4.4: Sampling a 2-D contour pixel-by-pixel

center G. Figure 4.4 depicts the concept graphically in the case of a theoretical con-

tour, while Figure 4.5 shows the correspondence between a real-life contour and its her

representation.

The entropy associated to each maximum in γ(s) can be interpreted as the ‘signature’

of the distribution of contour pixels in the neighborhood of the considered maximum: if

r(P1, P2) denotes the straight line passing through the points P1 and P2, then the area

contained by the curve γ(si − σi, si + σi) ∪ r(si − σi, si + σi) is the quantity that gets

associated to the maximum in γ(si).

The discrete form of the Fourier Transform [34] is also often used as a shape descrip-

tor. It has several nice well-known mathematical properties, most importantly linearity.

Therefore, by invoking Parseval’s theorem, it can be proven that searching by a Fourier

index, however reduced, we can expect no false dismissals. In other words, images that lie

within the specified distance from the query image will never fail to appear in the answer

set. The reason is that since many Fourier coefficients are discarded, the distance between

two items in feature space is less than the original distance in pixel space. This indeed

makes sure that there are no false dismissals, but on the other hand it might introduce

some false alarms that must be filtered out in a postprocessing step.

Although the dft and the closely related dct (Discrete Cosine Transform), used by

jpeg, are able to capture a good deal of information about images, sharply straight lines


Figure 4.5: The correspondence between a contour and its her representation


M = 1 M = 4 M = 10 M = 60

Figure 4.6: Approximation of a shape by the first M coefficients of its dft

M = 3 M = 6 M = 8 M = 10

1

O

3

2 1

O

3

5

4

6

2 1

O

3

8

5

4

6

7

2 1

O

3

10

8

5

4

697

2

Figure 4.7: Approximation of a shape by its M largest maxima using her

can’t be effectively represented unless we are willing to use enough coefficients. As shown

by Zahn and Roskies [49], an adequate approximation of a polygon requires 15–30 co-

efficients. When the object has highly irregular or jagged contours, even 30 coefficients

are not enough to characterize the shape adequately for accurate reconstruction. The

increasingly good approximation of a shape by its dft is shown graphically in Figure 4.6.

Differently from Fourier-based methods, the her representation was never meant for

reconstructing the signal. However, it is indeed possible to reconstruct the contour if one

feature is added to the her representation: the angle made by the current maximum and

some reference line—say, the positive X axis. This feature might be employed to enhance

the system’s usability by providing the user with feedback about the actual appearance

of the query shape, at the cost of a 33% increase in index size. In this case, Figure 4.7

shows how a her reconstruction changes when increasing the number M of maxima.

In Figure 4.7, it is assumed that all interpolation between maxima is done by straight

line segments. In principle it is possible to use curves as to fit the position i where the

maximum x(i) occurs, but in practice the final effect is usually not worth the extra effort.


4.3.1 Properties of HER

her for shape contours has several nice invariance properties. In particular, we list the

most important below.

• Translation invariance. her is obviously invariant to any translation of the object in

the image space, since the reference system gets translated together with the center

of the object.

• Rotation invariance. her is invariant to object rotation for any integer multiple

of π/2. Indeed, in the case of continuos signals, it is invariant for any angle: a

rotation of the object only causes a change of phase on γ(s) and therefore on the

her representation.

• Reflection invariance. Mirror reflection is equivalent to a change in the direction

of the contour scan used to build the 1-D signal. However, the maxima get or-

dered according to their magnitude, not their order of occurrence in the scan; as a

consequence, there’s no change in the vector of ordered maxima.

• Scaling invariance.Suppose that an object A has two adjacent maxima m1 and m2

in the curvilinear coordinate representation of its contour C. Let d be the distance

between m1 and m2. Now perform a scaling transformation on A obtaining A′ (and

m′1, m

′2, C

′ and d′). Then, there is the following relation between the length of the

contours |C| and |C ′| in the original object and its scaled version:

|C||C ′| =

d

d′. (4.15)

If we represent C and C ′ in curvilinear coordinates, then the following is also true:

|γ||γ′| =

d

d′, (4.16)

where d and d′ represent the distance between any two maxima resp. in A and A′.

Summing up, if we want the system to be fully invariant to scaling, all we have to

do is substituting the test for ‘approximate equality’ between y1 and y2 with a test

for ‘approximate linear dependence.’


Table 4.1: Invariance properties of her for contours

Contrast Scaling YES

Luminance Shifting YES

Rotation YES

Reflection YES

Translation YES

Zoom YES*

These invariance properties are summarized in Table 4.1. The asterisk in the Zoom

invariance row means ‘yes, if we change the test to include (approximate) linear depen-

dence.’

A few words about the ‘approximate equality’ test are in order. Techniques for approx-

imate matching utilizing feature extraction usually utilize the distance in feature space as

the main indicator of similarity.

Let T be a mapping from image space I to feature space F . T is said to be a complete

mapping iff any element XI ∈ I has exactly one image XF = T (XI) ∈ F . Non-completemappings may still be usable for pattern recognition tasks if they meet the weaker (and

more fuzzily defined) condition of separating clusters in feature space. Let XI and YI be

two objects in I and let XF = T (XI) and YF = T (YI) be their images in feature space;

let D(·, ·) be G(·, ·) distance metrics resp. in I and F .When the transformation T represents, for instance, a Fourier transform with trunca-

tion of all but the first α harmonics, Parseval’s theorem tells us that

G(XF , YF ) ≤ D(XI , YI) (4.17)

limα→N−1

Gα(XF , YF ) = D(XI , YI), (4.18)

where N is the number of points in the input signals—in this context, supposing we are

dealing with contour signals, the number of pixels in the contours of XI and YI .

When T denotes the her representation, we have a similar result:

G(XF , YF ) ≤ D(XI , YI) (4.19)

limα→N

Gα(XF , YF ) = D(XI , YI), (4.20)


(D)

(A)

(C)(B)

Figure 4.8: Converting a 2-d texture into a 1-d time series. (A): what the texture looks

like; (B): the partition element (texture tile); (C): the spiral; (D): the resulting 1-d signal.

only that in this case α represents the number of maxima utilized in the her representation

rather than the number of Fourier harmonics.

The value that provides the best tradeoff between ease of computation and effective

results is dependent on the specific shapes of the objects we are dealing with. This is true

for both her and Fourier transform based methods.

4.4 HER for Textures

How do we generate a time series—that is, a sequence—from intrinsically 2-d texture

data? Our solution is that of following a spiral path in the texture element, as shown in

Fig. 4.8.

4.4.1 Invariance Properties

Once the index data is extracted as shown above, it is organized into a spatial data access

structure for subsequent searching. The structure of choice in this case is a k-d-tree.

As stated earlier, this method is invariant to some types of image transformations.

Here are a few more details about the existing invariances.


(a) (b)

1 2 3 6 9* 8 7 4 5* - 1 ...

1 2 3

6

987

4 5

7* 4 1 2 3 6 9* 8 5 - 7 ...

1

5

9 6 3

7 4

8 2

Figure 4.9: Rotating the partition element yields different local maxima

Let us consider pixel transformations first. Contrast scaling—that is, multiplying all

image pixel values by some constant—does not change the relative energy of any pixel. For

this reason, none of the data generated by the algorithm undergoes any change. Similarly,

luminance shifting—that is, summing a constant to all the pixel values—does not change

the relative order of the maxima and therefore does not modify the index.

As for geometric transformations, translation is not even ‘seen’ by our method, since

it only deals with data coming from a segmentation phase. Image rotation/reflection and

zooming, however, do change the partition element (texture tile) and in general yield

different index data. Consider the theoretical partition element shown in Fig. 4.9 as an

example. The local maxima in the sequence are marked by asterisks. As a clockwise

rotation of π/2 is applied, the maxima reported by the spiral method change. However, it

can be seen that ‘real’ 2-d maxima (i.e., the value 9) are always reported; what happens for

different rotated versions of the element is that ‘spurious’ 1-d maxima appear. A similar

line of reasoning comes into play with mirror reflections and zoom.

Summing up, it is possible to make her for textures invariant to both rotations and

zoom if the maxima are located with a true 2-d algorithm and if the computation of

relative energy (Step H3 in the algorithm) is done with a 2-d, rather than 1-d, Gaussian.

Table 4.2 summarizes the invariance properties for texture-based retrieval. The aster-

isks denote issues that the method can be made invariant to, provided some caution is

exercised (e.g., the 4-distance in Step H2 should be normalized with respect to the size of

the element). Even as it is now, however, the method appears to be robust with respect

to transformations—such as rotation—for which there is no theoretical (that is, absolute)


Table 4.2: Invariance properties for of her for textures

Contrast Scaling YES

Luminance Shifting YES

Rotation NO*

Reflection NO*

Translation YES

Zoom NO*

invariance. This is illustrated in more detail in Section 4.4.2, which deals with the results

obtained by experimenting with the method.

4.4.2 Experimental Results

Several experiments have been performed in order to assess the validity of her for textures.

For these tests, the main database used was the Brodatz set of textures [6].

The original Brodatz dataset included 167 textures, but we added several transformed

versions in order to test the robustness of retrieval. Furthermore, we have scaled the

textures down to 32× 32 pixels from the original 128× 128, and kept only the luminancechannel, thus obtaining 8-bit images. There are several reasons for this choice, the main

one being that in most practical applications applications the texture element is usually

limited to sizes in the range of 16× 16 to 64× 64.The experiments were mainly aimed at assessing the robustness to pixel transforma-

tions and geometric transformations. For this reason, the database was augmented with

variations of the original textures including the following:

• Luminance-shifted versions (made brighter and darker by different amounts);

• Color-reduced versions where the colors were reduced in number from the original 256

to 16;

• JPEG encoded and decoded versions (average compression factor ranging from 1 : 20

to 1 : 30;

• Contrast-scaled versions, with different amounts of scaling;


• Noisy versions, with different amounts of Gaussian noise added—10%, 20% or 50% of

the total dynamic range (that is, since we are dealing with 8-bit images, average 25.5,

average 50 and average 128);

• Mirror-reflected versions;

• Rotated versions (only integer multiples of π/2 were considered).

A part of the augmented database can be seen in Fig. 4.10. As an example, Element

#1 is Bark.0000 and its transformed versions are in positions 3–16 inclusive; Element #2

is Metal.000 and its transformed versions are in positions 172–185.

In the first set of experiments, we used one of the original textures as the query and

looked at the returned results to see which transformed versions came up highest (closest)

in the answer set. In all cases, there were several matches at distance 0 in feature space

from the query texture. In particular, color reduction, contrast scaling and luminance

shifting do not change the value of the index and therefore come out at distance 0. Other

transformations do change the index, but the change is small enough for the transformed

image to be returned in the top positions of the answer set, just after the images at

distance 0.

The results of a sample query are depicted in Fig. 4.11. As can be seen, the first

8 matches consist entirely of transformed versions of the query image. The following

8 matches (that is, those in the second row) contain several spurious elements, including

Fabric.0005 and some of its transformed versions. Similar results were obtained for all

the 15 query tiles that were used in our full-database retrieval experiments.

Another example is shown in Fig. 4.12. In this case the query image is Bark.0000 ;

as before, the first row consists entirely of transformations of the query image. Some

spurious results such as Food.0000 and Fabric.0000 appear in the second and third rows,

along with other representatives of the Bark family.

A nice side effect of any query is that of partitioning the database into bins (clusters)

which are plainly visible by graphing the distances. As an example, Fig. 4.13 shows the

250 smallest distances from the images in the database to the query image—in this case,

Fabric.0001 (Element #36 in Fig. 4.10). The distances are sorted in increasing order.

The apparent plateaus or quasi-plateaus in the graph point to clusters of similar (or

index-identical) images in the database.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0

16

32

48

64

80

96

112

128

144

160

176

192

208

224

240

Figure 4.10: A selection of 256 tiles from the complete texture database utilized for the

experiments


Figure 4.11: Results of a sample query: Metal.0000 (Element #2 in Fig. 4.10)

Figure 4.12: Results of a sample query: Bark.0000 (Element #1 in Fig. 4.10)


0

100000

200000

300000

400000

500000

600000

700000

800000

900000

0 50 100 150 200 250

dist

ance

RANK

Figure 4.13: Distances from Fabric.0001 (#36 in Fig. 4.10) to the closest 250 matches

in the database


Another set of experiments (‘focus experiments’) was performed on restricted versions

of the database, containing the full set of original textures and only the transformations

of a single tile. In these cases the query tile was one of the transformed versions. We

found that in most cases the original and all transformed versions of the query image were

retrieved in the very first positions of the answer set, the one exception being the versions

with added Gaussian noise. This was to be expected, since the addition of substantial

amounts of noise tends to distort the textures significantly, especially considering the

small size of the tiles (32× 32 pixels).Table 4.3 summarizes the results obtained using 10 query images. For each query,

we fixed the parameter σ at several different values; furthermore, we fixed the “residual

energy” termination parameter (the threshold for alternate termination in Step H6 of the

algorithm) at different values, too. Each original image in the Brodatz dataset has several

transformed versions present in our database; the average number of such variations is 16.

Let us define a ‘relevant match’ for a given query image Q another image, present

in the database, which can be obtained from Q by applying one of the above mentioned

transformations. For instance, if the query image is a rotated version of Bark.0000, then

the original Bark.0000 is a relevant match, as is another, differently, rotated version.

The column labeled ‘match/12’ in Table 4.3 reports how many relevant matches are

found in the first 12 elements of the answer set. It should be noted that such ‘closest’

relevant matches are always consecutive, with no false alarm in between. Given the in-

variance and robustness properties described above, this is to be expected. The columns

‘match/24’ and ‘match/36’ give similar information for larger answer sets. In this case,

there are intervening false alarms.

As the last remark about Table 4.3, the column labeled σ was obtained by fixing σ

to a particular value. This has a direct impact on the number of maxima found by the

algorithm. In effect, this parameter is a normalized quantity, between 0 and 1, proportional

to the minimum spacing between maxima, in the sense that two local maxima falling less

than σ ·N pixels apart will count as a single one.

When all the maxima in the signal are significant, even if very close to one another, then

small values of σ are better suited; on the contrary, when not all maxima are significant,

higher values of σ can help to reduce the influence of noise. In the case of contours,

all maxima are usually important, since a maximum generally represents a maximum


E% σ match/12 match/24 match/36

5 .1 9 12 13

10 .1 8 8 8

15 .1 8 8 13

20 .1 8 9 13

25 .1 8 13 13

30 .1 9 12 13

35 .1 8 8 9

5 .15 10 10 10

10 .15 10 11 11

15 .15 11 12 12

20 .15 9 10 10

25 .15 9 9 9

30 .15 8 8 9

35 .15 9 11 14

5 .2 9 12 13

10 .2 9 12 12

15 .2 8 12 13

20 .2 9 9 12

25 .2 8 8 10

30 .2 12 12 12

35 .2 12 12 12

5 .25 8 8 8

10 .25 10 12 12

15 .25 9 9 11

20 .25 10 11 12

25 .25 9 11 11

30 .25 10 11 12

35 .25 11 11 11

Table 4.3: Tabular results of texture-based retrieval performed on the extended Brodatz

data set


elongation point of the contour, that is, a vertex of the object [15]. In the case of textures,

as illustrated in Table 4.3, going from small to large values of σ yields the expected decrease

in the number of maxima, but does not affect the quality of the answer set in an apparent

way.

We made several experiments using specific cut-down versions of the database. For

each experiment, the databases contained 168 elements: the original 167 Brodatz tiles,

plus a single transformed version of one of them. In each case, querying the system

with the transformed version yields the original (or vice versa). The only exception is

with Gaussian noise addition: this particular transformation has a strong effect on the

resulting texture and irreversibly destroys information, even more so since we are dealing

with small texture tiles (32 × 32). In the cause of Gaussian noise addition of 10% and

20% intensity, the average rank of the relevant match was a little more than 4: about 3

for 10% noise, about 5 for 20% noise.

Another remark that can be made is that increasing the processed fraction of signal

energy—and then the number of maxima, i.e., of features—does not significantly improve

the quality of the retrieval. Indeed, in some cases, the number of relevant matches de-

creases. Looking at Figure 4.14, the near-linear relation between signal energy and number

of maxima is apparent. However, the 4 graphs in Figure 4.15 show that a larger number

of maxima has no evident relation to the quality of the answer set. This is due to the

fact that increasing the energy introduces in the index some lesser maxima, which do not

characterize the element and therefore give way to additional false alarms. In other words,

it’s not the number of features, but their relevance that makes the difference.

4.4.3 Comparison with a Wavelet Based Method

There are several methods available for image retrieval, and the methods based on the

multiresolution formulation of wavelet transforms are among the most reliable and ro-

bust [44, 45]. The scientific literature features many techniques based on wavelets (for a

sample, see [27,47,5]), and for this comparison we picked one method that shows interest-

ing characteristics and good performance. Its name is hs (Hierarchical Signature) [2].

The comparison was aimed at assessing the efficiency and effectiveness of the retrieval.

In particular, efficiency is related to the computational requirements and to the index size,

while the effectiveness has to do with the quality of the answer set. Methods based on


0

10

20

30

40

50

60

70

5 10 15 20 25 30 35

Num

ber

of m

axim

a

Energy %

sigma=0.10sigma=0.15sigma=0.20sigma=0.25

Figure 4.14: Relation of the energy fraction used to the number of maxima found (index

size)


6

7

8

9

10

11

12

13

0 10 20 30 40 50 60 70

Rel

evan

t mat

ches

N. of maxima

sigma=0.10

6

7

8

9

10

11

12

13

5 10 15 20 25 30 35 40 45

Rel

evan

t mat

ches

N. of maxima

sigma=0.15

6

7

8

9

10

11

12

13

5 10 15 20 25 30 35 40

Rel

evan

t mat

ches

N. of maxima

sigma=0.20

6

7

8

9

10

11

12

13

5 10 15 20 25 30 35

Rel

evan

t mat

ches

N. of maxima

sigma=0.25

Figure 4.15: A graphical view of the outcome of texture-based retrieval performed on

the extended Brodatz data set


the wavelet transform require a time of at least O(N logN), where N is the number of

pixels in the image. Most methods, including hs, go as far as O(N2), due to the additional

processing involved in index construction.

In the case of such multiresolution approaches, the size of the index data for a single

image depends heavily on the so-called ‘detail level’ where the match is performed. Usually,

the match is performed on all levels, in a hierarchical fashion or following some other

scheme. hs, however, chooses an ‘optimal level’. Usually, it is the deepest level, that is,

the biggest, but which level is optimal does depend on the image and the whole database,

so that it is not easy to decide a priori. Therefore, the usual course of action for hs is

having the index made up with data from all the levels, then choosing the optimal one for

matching at run time only. This has the disadvantage of yielding a bigger index, but at

the same time in principle it allows lossless compression and decompression of the image

to be integrated with the database management system, which is often a desirable feature.

As for the quality of the retrieval, wavelet-based approaches are very robust and tol-

erate even the addition of Gaussian noise to the query texture without too negative con-

sequences. As an example, the average rank of Gaussian-perturbed textures in ‘focus

experiments’ is 2 (it is 4 for heat). With regard to other transformations, the results are

very similar to heat. In the usual working conditions, as few as 4 or 5 maxima are usually

enough to characterize a texture in an effective way. Indeed, having too many maxima in

the index does not improve on the performance, as shown in Fig. 4.15. As a consequence,

the typical size of heat indices is rather small.

On the other hand, a typical wavelet-based index requires about a hundred coefficients

to work with good accuracy. Summing up, heat’s performance in terms of quality are very

close to those of methods based on the wavelet transform, but it is much less costly in terms

of computing resources and index size. Additionally, as stated above, this representation

can be effectively used for different kinds of data; in particular, it was originally designed to

work with object contours! Table 4.4 summarizes the different strengths of heat and hs.

Summing up, it might be suggested that hs would be a more appropriate choice when-

ever we have to deal with JPEG images, or when it is desirable to have a database of

compressed images (in which case, if the compression could be integrated into the DBMS

paradigm), or when the pattern of induced noise is Gaussian-like (transmission errors).

This suggests the retrieval of images from the Internet as a very likely scenario.


Table 4.4: Quick comparison between hs and heat

Feature hs heat

Efficiency

Time√

Space√

Effectiveness

Luminance shift√

Contrast scaling√

JPEG√

Gaussian noise√

Color reduction√

Interoperability

Integrated compression√

Integrated contour√

On the other hand, heat works with small index sizes and is very quick in both index

construction and matching. It is invariant to transformations such as luminance shifting

and contrast scaling, and it is very easily combined with other matches based on e.g.,

contour data. All this suggests a DBMS dealing with medical images as an ideal field of

application: the majority of relevant objects in medical images are characterized by their

texture and their shape (contour); furthermore, lluminance shifting and contrast scaling

are typical disturbances that are likely to be introduced by medical imaging systems such

as tomography.


In order to test the performance of heri, we implemented it on a 233MHz Pentium-II

system using Matlab under Windows 98. Our experiments have been performed on a

number of databases of different size. As a general result, it appears that five maxima are

generally enough to capture a significant amount of signal energy for the signal sizes we

dealt with. However, it must be said that the percentage of energy needed to describe the

contours with adequate precision is strongly dependent on the type of signal examined


and the type of application.

Before showing the detailed results, we briefly review the evaluation criteria usually

adopted for the testing of retrieval systems. The recall measures the system’s ability to

retrieve all relevant objects, while the precision measures the system ability in retriev-

ing only relevant objects. Another indicator that is often used is the normalized recall

(NR) [38], which is defined as follows.

Suppose we have a database D of |D| objects where the number of objects relevant toour query is N < |D|. Besides, suppose that the relevant objects are sorted a priori so

that the most relevant object is X1, down to the less relevant object XN . We perform a

query that returns an ordered answer set A. Let ri be the rank of Xi in the answer set A.

The ideal rank (IR) is then defined as

IR =1N

N∑i=1

i =N + 12

(4.21)

(note that it does not depend on A); the average rank (AR) of A is

AR =1N

N∑i=1

ri (4.22)

The difference AR − IR gives a measure of the effectiveness of the system. It is usually

normalized in order to obtain a value between 0 and 1 known as normalized recall (NR):

NR =AR− IR|D| −N . (4.23)

We have evaluated heri’s performance with the use of 10 heterogeneous query time series

selected from a database of 1500. For each of the 10 query time series, we manually ranked

the 20 most similar objects in the database in order to compute the NR. The number of

maxima utilized for our tests is 10. We also performed the same queries using Euclidean

distance as a similarity measure. As can be seen, Heri outperforms ed even if the former

utilizes only 10 coefficients against the latter’s 15.

A visual comparison between heri and ed (see resp. Figures 4.16 and 4.17) shows the

effectiveness of heri, which is able to capture the ‘thin-shaped fish’ quality in the answer

set. ed, on the contrary, returns a few relevant objects, but also a false alarm (a rabbit).

Moreover, Figure 4.18 shows the results obtained with a database where rotated versions

of the query had been added. heri’s ability in retrieving rotated versions of the query


Table 4.5: Comparison between heri, Euclidean distance (ed) and a moment-based

technique (mbt) in terms of normalized recall

Measure Value HERI ED MBT

Size of DB 1500

Number of queries 10

Normalized recall 0.984 0.971 0.960

is apparent. The rotated replicas of the query appear first, since they are at distance 0

from the query—this is a consequence of her’s invariance to integer rotation discussed in

Section 4.3.1.

In order to evaluate heri’s efficiency, we have also performed comparative tests against

the moments-based technique [30, 36] considering the first two moments and their ratio.

As shown in Table 4.5 and Figure 4.19, this is the worst performing technique, having

many false dismissals.

Table 4.6 shows the results of a different set of queries over a slightly larger database

(1600 objects). For this experiment, answer set size was 70 throughout. Here is a discus-

sion of the data in the table: ‘% Enrg.’ is the fraction of the total signal energy utilized

for constructing the index. The value 10 for heri means that the threshold in Figure 4.1,

Step H6 has been set to 90%. The ‘Avg. dist.’ column contains the average distance be-

tween all database objects in feature space. In image space, the average distance is 5399.5,

as can be deduced from the ‘ed’ row (ed is calculated on the unmodified contours). The

‘Avg. A.S. dist.’ column shows the average inter-object distance (in feature space) inside

the answer set only. This quantity can give some insight on the average cluster size.

Finally, the ‘NR’ column shows the normalized recall obtained.

As can be seen, heri achieves good results while using only a little fraction of the total

energy present in the signal. The energy utilized has a raw relation to the number of actual

signal samples used in the index. What this ultimately means is that the whole index is

much lighter, since it contains less information. We have found that in many cases, 50%

(or even 10%!) of the signal energy is enough to achieve good retrieval results. Euclidean

distance, on the other hand, is calculated by taking into account all of the signal’s samples.

However, the optimal threshold to be used for the representation, and therefore the actual


Figure 4.16: An example of retrieval using heri

Table 4.6: Detailed comparative results for Euclidean Distance (ed), heri and moment-

based technique (mbt)

Technique%

Enrg.Avg.dist.

Avg.A.S.dist. NR

ed 100 5399.50 15.86 0.74

heri 10 611.07 7.73 0.84

mbt — 5960.43 16.02 0.62


Figure 4.17: An example of retrieval using Euclidean Distance


Figure 4.18: An example of heri’s ability to retrieve rotated versions of the query


Figure 4.19: An example of retrieval utilizing a moment-based technique


amount of signal energy used, is strongly dependent on the data contained in the database.

Another interesting fact emerging from the table is that the moment-based mapping of

image space into feature space is really an expansion rather than a contraction: the average

distance in feature space is more than in image space.

Finally, in order to attain a more efficient search, we utilized a spatial access structure.

Spatial access methods are techniques that allow for faster access to spatially organized

data [3, 26,29,31,39,46]. heri utilizes K-d-Trees [4] as its spatial access method.

4.6 Concluding Remarks

This chapter presented heri, a novel technique for content based retrieval of time series.

HERI is based on a Hierarchical Entropy based Representation (her) for 1-dimensional

signals that utilizes the local signal maxima as well as the entropy associated to each

of them. The experiments show that that this representation is both effective and very

general. In fact, in this paper the time series were really curvilinear representations of

the contours of shapes. Actually, this model can be profitably employed whenever it is

possible to obtain a 1-D representation of the patterns to be retrieved by content.

References

[1] R. Agrawal, C. Faloutsos, A. Swami. “Efficient similarity search in sequence

databases.” Proc. Foundations of Data Organization and Algorithms (FODO)

Evanston, IL, Oct. 1993. 14

[2] M. G. Albanesi, M. Ferretti, A. Giancane. “Robust hierarchical indexing based on

texture features.” Journal of Visual Languages and Computing 11, pp. 383–404, 2000.

52

[3] Beckmann, H. P. Kriegel, R. Schneider, B. Seeger. The R∗-tree: An efficient and

robust access method for points and rectangles. Proc. ACM SIGMOD, pp. 322–331,

May 1990. 10, 63

[4] J. L. Bentley, “Multidimensional binary search trees used for associative searching,”

Comm. ACM, Vol. 18, No. 9, pp. 509–517, Sept. 1975. 10, 63

[5] C. Brambilla, A. D. Ventura, I. Gagliardi, R. Schettini. “Multiresolution wavelet

transform and supervised learning for content-based image retrieval.” IEEE Int’l Con-

ference on Multimedia Computing and Systems, Vol. 1, 1999, pp. 183–188. 52

[6] P. Brodatz, Textures, A Photographic Album for Artists and Designers, Dover

Publications, New York, 1966. Avalaible (128 × 128) in a single tar file:

ftp://ftp.cps.msu.edu/pub/prip/textures/ 45

[7] S. K. Chang, Q. Y. Shi, C. W. Yan. “Iconic indexing by 2D-strings.” IEEE Trans.

Pattern Analysis Mach. Intell., 9(3), pp. 413–427, 1987. 9

[8] Y. C. Chang, B. K. Shyu, S. J. Wang, “Region-based fractal image compression with

quadtree segmentation”, Proc. ICASSP’97, Munich, 1997. 4

ftp://ftp.cps.msu.edu/pub/prip/textures/

66 References

[9] A. Del Bimbo, M. Campanai, P. Nesi. “A 3-dimensional iconic environment for image

database querying.” IEEE Trans. Soft. Eng. 19(10), pp. 997–1011, March 1993. 9

[10] A. Del Bimbo, P. Pala. Visual image retrieval by elastic matching of user sketches.

IEEE Trans. Pattern Analysis Mach. Intell., 19(2), Feb. 1997. 9

[11] A. Del Bimbo, M. De Marsico, S. Levialdi, G. Peritore, “Query by dialog: and inter-

active approach to pictorial querying,” Image and Vision Computing 16, pp. 557–569,

Elsevier, 1998.

[12] M. De Marsico, L. Cinque, S. Levialdi. Indexing pictorial document by their content:

A survey of current techniques. Image and Vision Computing Vol. 15, p. 119–141,

1997. 9

[13] R. Distasi, M. Nappi, S. Vitulano. Speeding up fractal encoding of images using

a block indexing technique. Proc. ICIAP‘97, Lecture Notes in Computer Science

vol. 1311, p. 101–107, Springer-Verlag, Florence, Sep. 1997.

[14] R. Distasi, M. Polvere, M. Nappi. “Split-decision functions in fractal image coding.”

IEE Electronics Letters 34(8), pp. 751–753, Apr. 1998. 12

[15] R. Distasi, D. Vitulano, S. Vitulano, “A hierarchical representation for content based

image retrieval,” Journal of Visual Languages and Computing, Special Issue on Mul-

timedia Databases and Image Communication, Vol. 5, n. 8, Aug. 2000. 30, 52

[16] R. Distasi, M. Nappi, M. Tucci, S. Vitulano. “Image indexing by contour analysis: a

comparison” Proc. IVWF4, Ischia, 2001, IEEE.

[17] R. Distasi, S. Vitulano. “Robust image retrieval based on texture information” Proc.

MDIC2001, Lecture Notes in Computer Science vol. 2184, Springer-Verlag, Amalfi,

2001.

[18] R. Distasi, M. Nappi, M. Tucci, S. Vitulano. “ConText: A technique for image re-

trieval integrating CONtour and TEXTure information” Proc. ICIAP2001, Palermo,

2001, IEEE.

References 67

[19] R. Distasi, S. Vitulano. “A Hierarchical Entropy Based Representation for Medical

Signals”, in V. Cantoni et al., eds., Human and Machine Perception 3: Thinking,

Deciding and Acting, Kluwer/Plenum Academic Press, New York, 2001. 30

[20] C. Faloutsos, W. Equitz, M. Flickner, W. Niblack, D. Petkovic, R. Barber. Efficient

and effective querying by image content. Journal of Intelligent Inf. Systems, 3(3/4),

p. 231–262, July 1994. 9

[21] Y. Fisher, Fractal Image Compression—Theory and Application, Springer–Verlag,

New York, 1994. 3, 9

[22] M. Flickner et al. “Query by image and video content: the QBIC system.” IEEE

Computer. “Finding the right image.” Special Issue on Content Based Image Retrieval

Systems, 28(9), pp. 23–32, Sep. 1995. 9

[23] T. Gevers, A. W. M. Smeulders. PicToSeek: a color image invariant retrieval system.

In A. W. M. Smeulders, R. Jain, eds., Image Databases and Multi-Media Search,

Series on Software Engineering and knowledge Engineering, vol. 8, p. 25–37, World

Scientific. 19, 20

[24] U. Glavitsch, P. Schauble, M. Wechsler, “Metadata for integrating speech documents

in a text retrieval system,” Sigmod Record, Vol. 23, No. 4, Dec. 1994. 29

[25] G. H. Grunlund, “Fourier preprocessing for hand print character recognition,” IEEE

Trans. Computers Vol. C21, 1972. 29

[26] A. Guttman, “R-Tree: A dynamic index structure for spatial searching,” Proc. ACM

SIGMOD, Boston, pp. 45–47, June 1984. 63

[27] C. E. Jacobs, A. Finkelstein, D. H. Salesin. “Fast multiresolution image querying,”

In Proc. ACM SIGGRAPH 95, NY, 1995, pp. 278–280. 52

[28] A. E. Jacquin. Image coding based on a fractal theory of iterated contractive image

transformations. IEEE Trans. Image Proc., vol. 1, pp. 18–30, Jan. 1992. 12

[29] H. V. Jagadish, “Linear clustering of objects with multiple attributes,” Proc. ACM

SIGMOD, pp. 332–342, Atlantic City, May 1990. 63

68 References

[30] A. K. Jain, Fundamentals of Digital Image Processing, Computer Science Press,

Rockville, 1989. 30, 58

[31] I. Kamel, C. Faloutsos, “On packing R-trees,” Proc. CIKM, 2nd International Conf.

on Information Knowledge Management, Nov. 1993. 63

[32] S. Y. Lee, F. J. Hsu. “Spatial reasoning and similarity retrieval of image using 2D

C-String knowledge representation.” Pattern Recognition, 25(3), pp. 305–318, 1992.

9

[33] M. Nappi, G. Polese, G. Tortora. FIRST: Fractal Indexing and Retrieval SysTem

for image databases. Image and Vision Computing 16(14), p. 1019–1031, Elsevier

Science, Dic. 1998. 10

[34] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Prentice-Hall, Engle-

wood Cliffs, NJ, USA, 1975. 38

[35] M. Otterman, “Approximate matching with high dimensionality R-trees,” M.Sc.

scholarly paper, Dept. of Computer Science, Univ. of Maryland, MD, USA, 1992.

[36] E. G. M. Petrakis, C. Faloutsos. “Similarity searching in medical image databases.”

IEEE Trans. Knowledge and Data Eng. 9(3), pp. 435–447, May/June 1997. 9, 58

[37] M. Polvere, M. Nappi. “Speed-up in fractal image coding: comparison of methods.”

IEEE Trans. Image Processing 9(6), pp. 1002–1009, June 2000. 10, 12

[38] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-

Hill, 1983. 29, 57

[39] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989.

13, 63

[40] D. Saupe, R. Hamzaoui, H. Hartenstein, “Fractal Image Compression–An Introduc-

tory Overview,” in Saupe D. and Hart J. (Eds): “Fractal Models For Image Synthesis,

Compression and Analysis”. SIGGRAPH 96 Course Notes, ACM, New Orleans, 1996.

3

[41] D. Saupe, S. Jacob, “Variance-based quadtrees in fractal image compression”, Elec-

tronics Letters 33,1, pp. 46–48, 1997. 4

References 69

[42] S. Sclaroff. “Distance to deformable prototypes: encoding shape categories for efficient

search.” In A. W. M. Smeulders, R. Jain, eds., Image Databases and Multi-Media

Search, Series on Software Engineering and Knowledge Engineering, vol. 8, 1997,

pp. 25–37, World Scientific. 20

[43] S. Sclaroff. Image database used in shape-based retrieval experiments available via

ftp at ftp://cs-ftp.bu.edu/sclaroff/pictures.tar.Z. 20

[44] G. Strang, T. Nguyen. Wavelet and Filter Banks, Cambridge Press, Cambridge, 1997.

52

[45] P. N. Topiwala. Wavelet Image and Video Compression. Kluwer Academic, 1998. 52

[46] J. D. Ullman. Principles of database and knowledge-based systems. Computer Science

Press, Rockville, MD, USA, 1988. 10, 63

[47] G. van de Wouwer, P. Sceunders, S. Livens and D. van Dick. “Wavelet correlation

signature for color texture characterization,” Pattern Recognition 32, 1999, pp. 443–

451. 52

[48] Various Authors, Finding the Right Image. IEEE Computer. Special Issue on Content

Based Image Retrieval Systems, 28(9), Sep. 1995.

[49] C. T. Zahn and R. Z. Roskies, “Fourier descriptors for plane closed curves,” IEEE

Trans. Computers, Vol. C21, 1972. 29, 40

ftp://cs-ftp.bu.edu/sclaroff/pictures.tar.Z

Appendix A

Additional Details

A.1 Fractal Index Invariance to Contrast Scaling

Consider the original block b and the transformed block b′ = wb.

A.1.1 Center of mass

We can limit our attention to the x coordinate, since an identical line of reasoning applies

to y. For b, we have

x =1M

∑1≤i≤n1≤j≤n

ibi,j .

Obviously,

M ′ =∑

1≤i≤n1≤j≤n

b′i,j =

∑1≤i≤n1≤j≤n

wbi,j = wM.

The x coordinate of the center of mass in the transformed block b′ is then

x′ =1M ′

∑1≤i≤n1≤j≤n

ib′i,j =

1wM

∑1≤i≤n1≤j≤n

i(wbi,j) = x. (A.1)

This simple argument proves invariance of mass center position for the first iteration

(k = 0) of Eq. (3.5). Subsequent iterations (k > 0) can be handled as follows.

A.1.2 Higher Deviates

The following relation is easy to prove by induction:

b′(k)i,j = w2k

b(k)i,j . (A.2)

72 Appendix A. Additional Details

The case k = 0 has been shown in the previous discussion. For k > 0, assume Eq. (A.2)

as the induction hypothesis and observe that this implies µ′k = w2kµk. We then have

b′(k+1)i,j = (b′(k)

i,j − µ′k)2

= (w2kb(k)

i,j − w2kµk)

2

= w2k+1(b(k)

i,j − µk)2

= w2k+1b(k+1)

i,j .

In words, higher deviates of the original and the transformed block are still proportional,

albeit with a different factor depending on k. Therefore, we can apply Eq. (A.1) to b(k)

and b′(k) and state that (x′k, y′k) = (xk, yk) for all k ≥ 0.

A.2 Fractal Index Invariance to Luminance Shifting

Consider the original block b and the transformed block b′ = b+m1.

A.2.1 Center of Mass

We utilize the following notations: CM(b) indicates the center of mass of the n×n block b,

while M(b) indicates its mass. The symbol 1 stands for an n×n block filled with 1’s. Thecalculations are done in an orthogonal coordinate system centered in the geometric center

of the block.

CM(b′)

= CM(b+m1)

=M(b)CM(b) + M(m1)CM(m1)

M(b) +M(m1)

=M(b)CM(b) +mM(1)(0, 0)

M(b) +mM(1)

=M(b)

M(b) +mn2CM(b).

Since CM(b′) is a scalar multiple of CM(b), their polar angles are equal.

Index

AVRR

definition, 20

Brodatz, 45

canonical form, 18

center of mass, 71, 72

color change, 15, 18

color reduction, 45

contrast scaling, 15, 16, 44, 45, 71

databases

homogeneous vs. heterogeneous, 9

deviates, 71

domains, 11

entropy

as a split-decision function, 5

Euclidean distance, 29

feature vectors, 12, 13

‘focus experiments’, 50

Fourier descriptors, 29, 38

fractal coding

basics, 11

Gaussian distribution, 31

Gaussian noise, 46, 55

Hierarchical Signature, 52

histogram, 13

image partition, 3

invariance

in indexing, 10, 17, 18, 41, 43

isometries, 19

in fractal coding, 13

JPEG, 45

luminance shifting, 15, 17, 44, 45, 72

mass center, see center of mass

moments, 30

multichannel images, see RGB, YIC

Parseval’s theorem, 42

partition, 4, 13, 17

PicToSeek, 19, 22

quadtree, 3, 13, 17

ranges, 11

reflections, 16, 18, 41, 44, 46

RGB, 14

vs. YIQ, 12

RMS error, 4

rotations, 16, 18, 41, 44, 46

scaling (size), see zoom

sigma (σ), 33

73

74 INDEX

spiral, 43

split decision function, 3

standard deviation, 32

threshold, 33

adaptive vs. fixed, 6

translations, 41, 44

variance

n-fold, 4

standard, 4

with fixed threshold, 4

wavelets, 52

YIQ, 14

vs. RGB, 12

zoom, 41, 44

Documents

Compression and Indexing of Digital Images - UNISA · Dottorato di Ricerca in Informatica XIII ciclo Universit`adiSalerno Compression and Indexing of Digital Images Riccardo Distasi