11

Click here to load reader

Hallucinating optimal high-dimensional subspaces

  • Upload
    ognjen

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hallucinating optimal high-dimensional subspaces

Hallucinating optimal high-dimensional subspaces

Ognjen Arandjelović n

Centre for Pattern Recognition and Data Analytics (PRaDA), School of Information Technology, Deakin University, Geelong 3216 VIC, Australia

a r t i c l e i n f o

Article history:Received 3 May 2013Received in revised form31 January 2014Accepted 12 February 2014Available online 20 February 2014

Keywords:ProjectionAmbiguityConstraintSVDSimilarityFace

a b s t r a c t

Linear subspace representations of appearance variation are pervasive in computer vision. This paperaddresses the problem of robustly matching such subspaces (computing the similarity between them)when they are used to describe the scope of variations within sets of images of different (possibly greatlyso) scales. A naïve solution of projecting the low-scale subspace into the high-scale image spaceis described first and subsequently shown to be inadequate, especially at large scale discrepancies.A successful approach is proposed instead. It consists of (i) an interpolated projection of the low-scalesubspace into the high-scale space, which is followed by (ii) a rotation of this initial estimate within thebounds of the imposed “downsampling constraint”. The optimal rotation is found in the closed-formwhich best aligns the high-scale reconstruction of the low-scale subspace with the reference it iscompared to. The method is evaluated on the problem of matching sets of (i) face appearances undervarying illumination and (ii) object appearances under varying viewpoint, using two large data sets. Incomparison to the naïve matching, the proposed algorithm is shown to greatly increase the separation ofbetween-class and within-class similarities, as well as produce far more meaningful modes of commonappearance on which the match score is based.

& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

One of the most commonly encountered problems in computervision is that of matching appearance. Whether it is images of localfeatures [1], views of objects [2] or faces [3], textures [4] orrectified planar structures (buildings, paintings) [5], the task ofcomparing appearances is virtually unavoidable in amodern computervision application. A particularly interesting and increasingly impor-tant instance of this task concerns the matching of sets of appearanceimages, each set containing examples of variation corresponding to asingle class.

A ubiquitous representation of appearance variation within aclass is by a linear subspace [6,7]. The most basic argument forthe linear subspace representation can be made by observingthat in practice the appearance of interest is constrained toa small part of the image space. Domain-specific informationmay restrict this even further e.g. for Lambertian surfacesseen from a fixed viewpoint but under variable illumination[8–10] or smooth objects across changing pose [11,12]. Moreover,linear subspace models are also attractive for their lowstorage demands – they are inherently compact and can be learnt

incrementally [13–18]. Indeed, throughout this paper it is assumedthat the original data from which subspaces are estimated is notavailable.

A problem which arises when trying to match two subspaces –each representing certain appearance variation – and which hasnot as of yet received due consideration in the literature is that ofmatching subspaces embedded in different image spaces, that is,corresponding to image sets of different scales. This is a frequentoccurrence: an object one wishes to recognize may appear largeror smaller in an image depending on its distance, just as a facemay, depending on the person's height and positioning relative tothe camera. In most matching problems in the computer visionliterature, this issue is overlooked. Here it is addressed in detailand shown that a naïve approach to normalizing for scale insubspaces results in inadequate matching performance. Thus, amethod is proposed which without any assumptions on the natureof appearance that the subspaces represent constructs an optimalhypothesis for a high-resolution reconstruction of the subspacecorresponding to low-resolution data.

In the next section, a brief overview of the linear subspacerepresentation is given first, followed by a description of theaforementioned naïve scale normalization. The proposed solutionis described in this section as well. In Section 3 the two approachesare compared empirically and the results are analysed in detail.The main contribution and conclusions of the paper are summar-ized in Section 4.

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/pr

Pattern Recognition

http://dx.doi.org/10.1016/j.patcog.2014.02.0060031-3203 & 2014 Elsevier Ltd. All rights reserved.

n Tel.: þ61 3522 73079.E-mail address: [email protected]: http://mi.eng.cam.ac.uk/�oa214

Pattern Recognition 47 (2014) 2662–2672

Page 2: Hallucinating optimal high-dimensional subspaces

2. Matching subspaces across scale

Consider a set X �Rd containing vectors which represent raster-ized images:

X ¼ fx1;…; xNg ð1Þwhere d is the number of pixels in each image. It is assumed thatall of the images represented by members of X have the sameaspect ratio, so that the same indices of different vectors corre-spond spatially to the same pixel location. A common representa-tion of appearance variation described by X is by a linear subspaceof dimension D, where usually it is the case that Dbd. If mX is theestimate of the mean of the samples in X,

mX ¼ 1N

∑N

i ¼ 1xi; ð2Þ

then BXARd�D, a matrix with columns consisting of orthonormalbasis vectors spanning the D-dimensional linear subspace embeddedin a d-dimensional image space, can be computed from the corre-sponding covariance matrix

CX ¼ 1N�1

∑N

i ¼ 1ðxi�mXÞðxi�mXÞT : ð3Þ

Specifically, an insightful interpretation of BX is as the row andcolumn space basis of the best rank-D approximation to CX:

BX ¼ arg minBARd�D

BT B ¼ I

minΛARD�DΛij ¼ 0;ia j

‖CX�BΛBT‖2F ; ð4Þ

where ‖ � ‖F is the Frobenius norm of a matrix.

2.1. The “Naïve solution”

Let BXARdl�D and BY ARdh�D be two basis vectors matricescorresponding to appearance variations of image sets containingimages with dl and dh pixels, respectively. Without loss of general-ity, let also dlodh. As before, here it is assumed that all imagesboth within each set, as well as across the two sets, are of the sameaspect ratio. Thus, we wish to compute the similarity of setsrepresented by orthonormal basis matrices BX and BY .

Subspaces spanned by the columns of BX and BY cannot becompared directly as they are embedded in different image spaces.Instead, let us model the process of an isotropic downsampling ofa dh-pixel image down to dl pixels with a linear projection realizedthough a projection matrix PARdl�dh . In other words, for a low-resolution image set X �Rdl ,

X ¼ fx1;…; xNg ð5Þthere is a high-resolution set Xn �Rdh , such that

Xn ¼ fxn

i jxi ¼ Pxn

i ; i¼ 1;…;Ng: ð6ÞThe form of the projection matrix depends on (i) the projectionmodel employed (e.g. bilinear, bicubic, etc.) and (ii) the dimen-sions of high and low scale images; see Fig. 1 for an illustration.

Under the assumption of a linear projection model, the least-square error reconstruction of the high-dimensional data can beachieved with a linear projection as well, in this case by PR whichcan be computed as

PR ¼ PT ðPPT Þ�1: ð7ÞSince it is assumed that the original data from which BX wasestimated is not available, an estimate of the subspace correspond-ing to Xn can be computed by re-projecting each of the basisvectors (columns) of BX into Rdh :

~Bn

X ¼ PRBX : ð8Þ

Note that in general ~Bn

X is not an orthonormal matrix, i.e.~Bn

XT ~Bn

XaI. Thus, after re-projecting the subspace basis, it isorthogonalized using the Householder transformation [19], produ-cing a high-dimensional subspace basis estimate Bn

X which can becompared directly with BY .

2.1.1. Limitations of the Naïve solutionThe process of downsampling an image inherently causes a loss of

information. In re-projecting the subspace basis vectors, informationgaps are “filled in” through interpolation. This has the effect of con-straining the spectrum of variation in the high-dimensional recon-structions to the bandwidth of the low-dimensional data. Compared tothe genuine high-resolution images, the reconstructions are void ofhigh frequency detail which usually plays a crucial role in discrimina-tive problems.

2.2. Proposed solution

We seek a constrained correction to the subspace basis Bn

X . To thisend, consider a vector xn

i in the high-dimensional image space, Rdh ,which when downsampled maps onto xi in Rdl . As before, this ismodelled as a linear projection effected by a projection matrix P:

xi ¼ Pxn

i : ð9Þ

Writing the reconstruction of xn

i , computed as described in theprevious section, as xn

i þci, it has to hold

xi ¼ Pðxn

i þciÞ; ð10Þ

or, equivalently

0¼ Pci; ð11Þ

0

0.05

0.1

0.15

0.2

0.25

00.050.10.150.20.25

Fig. 1. The projection matrix PAR25�100 modelling the process of downsampling a10�10 pixel image to 5�5 pixels, using (a) bilinear and (b) bicubic projectionmodels, shown as an image. For the interpretation of image intensities see theassociated grey level scales on the right.

Fig. 2. A conceptual illustration of the main idea: the initial reconstruction of theclass subspace in the high dimensional image space is refined through rotationwithin the constraints of the ambiguity constraint subspace.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–2672 2663

Page 3: Hallucinating optimal high-dimensional subspaces

In other words, the correction term ci has to lie in the nullspace of P.Let Bc be a matrix of basis vectors spanning the nullspace which,given its meaning in the proposed framework, will henceforth bereferred to as the ambiguity constraint subspace. Then the actualappearance in the high-dimensional image space corresponding tothe subspace BXARdl�D is not spanned by the D columns of Bn

X butrather some D orthogonal directions in the span of the columns of½Bn

X jBc�, as illustrated in Fig. 2.Let BXc be a matrix of orthonormal basis vectors computed by

orthogonalizing ½Bn

X jBc�:

BXc ¼ orthð½Bn

X jBc�Þ ð12Þ

Then we seek a matrix TARðDþdh �dlÞ�D which makes the optimalchoice of D directions from the span of BXc:

BXcT¼ BXc½t1jt2j…jtD�: ð13ÞHere the optimal choice of T is defined as the one that best alignsthe reconstructed subspace with the subspace it is compared with,i.e. BY . The matrix T can be constructed recursively, so let usconsider how its first column t1 can be computed. The optimalalignment criterion can be restated as

t1 ¼ arg maxt01

maxa

ðBYaÞ � ðBXct01ÞJaJ Jt01 J

ð14Þ

Fig. 3. A summary of the proposed matching algorithm.

Fig. 4. (a) Illuminations 1–7 from the Cambridge face motion database (same pose and identity, different illuminations). (b) Five different individuals in the illumination settingnumber 6 (same illumination, different pose relative to he sources of illumination). In spite of the same spatial arrangement of light sources, their effect on the appearance of faceschanges significantly due to variations in people's heights and the ad lib chosen position relative to the camera.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–26722664

Page 4: Hallucinating optimal high-dimensional subspaces

Rewriting the right-hand side,

t1 ¼ arg maxt01

maxa

ðBYaÞ � ðBXct01ÞJaJ Jt01 J

ð15Þ

t1 ¼ arg maxt01

maxa

aT

JaJBTYBXc

t01Jt01 J

ð16Þ

t1 ¼ arg maxt01

maxa

aT

JaJUΣVT t01

Jt01 J; ð17Þ

where

BTYBXc ¼ u1j…juD½ �

zfflfflfflfflfflfflffl}|fflfflfflfflfflfflffl{Us1 0 … 0 … 00 s2 … 0 … 0⋮ ⋮ ⋱ ⋮ … 00 0 … sD … 0

26664

37775

zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{Σ

½v1j…jvDþdh �dl �Tzfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{VT

; ð18Þ

is the Singular Value Decomposition (SVD) of BTYBXc and

s1Zs2Z⋯ZsD. Then, from the right-hand side in Eq. (17), byinspection the optimal directions of a and t01 are, respectively, thefirst SVD “output” direction u1 and the first SVD “input” direction,i.e. a¼ u1 and t1 ¼ v1. The same process can be used to infer theremaining columns of T, the i-th one being ti ¼ vi.

Thus, the optimal reconstruction B0X of BX in the high-

dimensional space, obtained by the constrained rotation of thenaïve estimate Bn

X , is given by the orthonormal basis matrix:

B0X ¼ BXc½v1j…jvD� ð19Þ

The key steps of the algorithm are summarized in Fig. 3.

2.2.1. Computational requirements and implementation issuesBefore turning our attention to the empirical analysis of the

proposed algorithm let us briefly highlight the low additionalcomputational load imposed by the refinement of the re-constructed class subspace in the high-dimensional image space.Specifically, note that the output of Steps 1 and 3 in Fig. 3 can be pre-computed, as it is dependent only on the dimensions of the low andhigh scale data, not the data itself. Orthogonalization in Step 2 is fast,as D – the number of columns in BX – is small. Although at first sightmore complex, the orthogonalization in Step 4 is also not demanding,as Bc is already orthonormal, so it is in fact only the D columns of Bn

Xwhich need to be adjusted. Lastly, the Singular Value Decompositionin Step 6 operates on a matrix which has a high “landscape”eccentricity so the first D “input” directions can be computed rapidly,while Step 7 consists only of a simple matrix multiplication.

3. Experimental analysis

The theoretical ideas put forward in the preceding sections wereevaluated empirically on two popular problems in computer vision:matching sets of images of (i) face appearances and (ii) objectappearances. For this, two large data sets were used. These are

� The Cambridge Face Motion Database [20,21],1 and� The Amsterdam Library of Object Images [22].2

Their contents are reviewed next.

Fig. 5. Examples of 10 roughly angularly equidistant views (out of 73 available) of two objects from the “Object View Collection” subset of the Amsterdam Library of ObjectImages [22]. (a) Object 0001 – toy bear and (b) Object 0002 – keys on a chain.

1 Also see http://mi.eng.cam.ac.uk/�oa214/.2 Also see www.science.uva.nl/�aloi/.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–2672 2665

Page 5: Hallucinating optimal high-dimensional subspaces

3.1. Data

For a thorough description of the two data sets used, the readershould consult previous publications in which they are describedin detail [21,22]. Here they are briefly summarized for the sake ofclarity and completeness of the present analysis.

3.1.1. Cambridge face motion databaseThe Cambridge Face data set is a database of face motion video

sequences acquired in the Department of Engineering, Universityof Cambridge. It contains 100 individuals of varying age, ethnicityand sex. Seven different illumination configurations were used forthe acquisition of data. These are illustrated in Fig. 4. For everyperson enrolled in the database 2 video sequences of the personperforming pseudo-random motion were collected in each illumi-nation. The individuals were instructed to approach the camera,thus choosing their positioning ad lib, and freely perform headand/or body motion relative to the camera while real-time visualfeedback was provided on the screen placed above the camera.Most sequences contain significant yaw and pitch variation, sometranslatory motion and negligible roll. Mild facial expressionchanges are present in some sequences (e.g. when the user wassmiling or talking to the person supervising the acquisition).

3.1.2. Amsterdam Library of Object ImagesThe Amsterdam Library of Object Images is a collection of images

of 1000 small objects [22]. Examples of two objects are shown inFig. 5. The data set comprises three main subsets: (i) “Illumination Dir-ection Collection”, (ii) “Illumination Colour Collection” and (iii) “ObjectView Collection”. In the “Illumination Direction Collection” the cameraviewpoint relative to each object was kept constant, while illuminationdirection was varied. Similarly in the “Illumination Colour Collection”,images corresponding to different voltages of a variable voltagehalogen illumination source were acquired from a fixed viewpoint.Finally, “Object View Collection” contains view of objects under aconstant illumination but variable pose. These were acquired using 51increments of the object's rotation around an axis parallel to the imageplane. This collection was used in the evaluation reported here. Fig. 5shows a subset of 10 images (out of the total number of 360=5þ1¼ 73) which illustrate the nature of the data variability. Furtherdetails can be obtained by consulting the original publication [22] andfrom the web site of the database: www.science.uva.nl/�aloi/.

3.2. Evaluation protocol

In the case of both data sets, evaluation was performed bymatching high resolution with low resolution class models.

Fig. 6. Different scales used as low resolution matching input for (a) face and (b) object data. Square face images with the widths of 5–25 pixels at 5 pixel increments wereconsidered. Images from the Amsterdam Library of Object Images were sub-sampled to 5–25% of the original size, at 5% increments.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–26722666

Page 6: Hallucinating optimal high-dimensional subspaces

A single class was taken to correspond to a particular person or anobject when, respectively, face and object appearances werematched. High resolution linear subspace models were computedusing 50�50 pixel face data and 192�144 pixel object images, asdescribed in Section 2. Low resolution subspaces were constructedusing downsampled data. Square face images were downsampledto five different scales: 5�5 pixels, 10�10 pixels, 15�15 pixels,20�20 pixels and 25�25 pixels, as shown in Fig. 6(a). Data fromthe Amsterdam Library of Object Images was downsampled also tofive different scales corresponding to 5%, 10%, 15%, 20% and 25% ofits linear scale (e.g. height, while maintaining the original aspectratio), as shown in Fig. 6(b).

Training was performed by constructing class models withdownsampled face images in a single illumination setting in thecase of face appearance matching and downsampled object imagesusing half of the available data in the case of object appearancematching. Thus each class represented by a linear subspacecorresponds to a single person and captures his/her appearance

in the training illumination, and a single object using a limited setof views.

In querying an algorithm using a novel subspace, the subspacewas classified into the class of the highest similarity. The similaritybetween two subspaces was expressed by a number in the range½0;1�, equal to the correlation of the two highest correlated vectorsconfined to them, as per Eq. (14) in the previous section.

3.3. Results

First, the effects of the method proposed in Section 2.2 on classseparation were examined, and compared to that of the naïvemethod of Section 2.1. This was quantified as follows. For a givenpair of training and “query” illumination conditions, the similarityρi;j between all image sets i acquired in the training illuminationand all sets j acquired in the query illumination was evaluated.Thus, the mean confidences ew and eb of, respectively, the correct

5 10 15 20 25 30 35 40 45

100

101

102

0 5 10 15 20 25 30 35 40 45100

101

0 5 10 15 20 25 30 35 40 45100

101

0 5 10 15 20 25 30 35 40 45

100

100.1

Fig. 7. The increase in class separation μ (the ordinate; note that the scale is logarithmic) over different training-query illumination conditions (abscissa) achieved by theproposed method in comparison to the naïve subspace re-projection approach. Note that for clarity the training-query illumination pairs were ordered in increasing order ofimprovement for each plot; thus, the indices of different abscissae do not necessarily correspond. (a) ð5� 5Þ⟷ð50� 50Þ, (b) ð10� 10Þ⟷ð50� 50Þ, (c) ð15� 15Þ⟷ð50� 50Þ,(d) ð20� 20Þ⟷ð50� 50Þ.

020

4060

80100

0204060801000

2040

6080

100

020406080100

Fig. 8. Typical similarity matrices resulting from the naïve (a) and the proposed (b) matching approaches. Our method produces a dramatic improvement in class separationas witnessed by the increased dominance of the diagonal elements in the aforementioned matrix.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–2672 2667

Page 7: Hallucinating optimal high-dimensional subspaces

and incorrect matching assignments are given by

ew ¼ 1� 1M

∑M

i ¼ 1ρi;i ð20Þ

eb ¼ 1� 1M � ðM�1Þ ∑

M

i ¼ 1∑M

j ¼ 1;ja iρi;j; ð21Þ

where M is the number of distinct classes. The correspondingseparation is then proportional to eb and inversely proportional toew:

μ¼ ebe�1w : ð22Þ

The separation was evaluated separately for all training–queryillumination pairs in the Cambridge Face Database using the naïvemethod and compared with that of the proposed solution acrossdifferent matching scales using the bicubic projection model. Aplot of the results is shown in Fig. 7 in which for clarity thetraining–query illumination pairs were ordered in increasing orderof improvement for each plot (thus the indices of differentabscissae do not necessarily correspond).

Firstly, note that improvement was observed for all illumina-tion combinations at all scales. Unsurprisingly, the most significant

increase in class separation (E8.5-fold mean increase) wasachieved for the most drastic difference in training and querysets, when subspaces embedded in a 25-dimensional image space– representing the appearance variation of images as small as 5�5pixels, see Fig. 6(a) – was matched against a subspace embeddedin the image space of a 100 times greater dimensionality.

It is interesting to note that even at the more favourable scalesof the low resolution input, although the mean improvement wasless noticeable than at extreme scale discrepancies, the accuracy ofmatching in certain combinations of illumination settings stillgreatly benefited from the proposed method. For example, for lowresolution subspaces representing appearance in 10�10 pixelimages, the mean separation increase of 75.6% was measured;yet, for illuminations “1” and “2” – corresponding to the index 42on the abscissa in Fig. 7(b) – the improvement was 473.0%. Thechange effected on the inter-class and intra-class distances isillustrated in Fig. 8, which shows a typical similarity matrixproduced by the naïve and the proposed matching methods.

The mean separation increase across different scales for faceand object data is shown in, respectively, Figs. 9 and 10. These alsoillustrate the impact that the projection model used has on thequality of matching results, Figs. 9(a) and 10(a) corresponding tothe bilinear projection model, and Figs. 9(b) and 10(b) to the

5 10 15 20 25100

100.1

Image size

Cla

ss s

epar

atio

n− m

ean

rela

tive

impr

ovem

ent

5 10 15 20 25100

101

Image size

Cla

ss s

epar

atio

n− m

ean

rela

tive

impr

ovem

ent

Fig. 9. Mean class separation increase achieved as a function of size of the low-scale images. Shown is the ratio of class separation when subspaces are matchedusing the proposed method and the naïve re-projection method described inSection 2.1. The rate of improvement decay is incrementally exponential, reaching 1(no improvement) when dl¼dh. (a) Faces – bilinear projection model and (b) faces –bicubic projection model.

5 10 15 20 25100

100.1

100.2

100.3

Image size (% of original)

Cla

ss s

epar

atio

n− m

ean

rela

tive

impr

ovem

ent

5 10 15 20 25100

Image size (% of original)

Cla

ss s

epar

atio

n− m

ean

rela

tive

impr

ovem

ent

Fig. 10. Mean class separation increase achieved as a function of size of the low-scale images. Shown is the ratio of class separation when subspaces are matchedusing the proposed method and the naïve re-projection method. Unlike in the plotsobtained from face matching experiments in Fig. 9, the nature of variation acrossdifferent scales here appears less regular. It is likely that the reason lies in the large(and variable in shape) area of the background present in the object images.(a) Objects – bilinear projection model and (b) objects – bicubic projection model.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–26722668

Page 8: Hallucinating optimal high-dimensional subspaces

Fig. 11. Bicubic projection model – the inferred most similar modes of variation contained within two subspaces representing face appearance variation of the same personin different illumination conditions and at different training scales. In each subfigure, which corresponds to a different training-query scale discrepancy, the top pairof images represents appearance extracted by the naïve algorithm of Section 2.1 (as the left-singular and right-singular vectors of BT

YBn

X); the bottom pair is extracted bythe proposed method (as the left-singular and right-singular vectors of BT

YBXc). (a) ð5� 5Þ⟷ð50� 50Þ, (b) ð10� 10Þ⟷ð50� 50Þ, (c) ð15� 15Þ⟷ð50� 50Þ,(d) ð20� 20Þ⟷ð50� 50Þ, (e) ð25� 25Þ⟷ð50� 50Þ.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–2672 2669

Page 9: Hallucinating optimal high-dimensional subspaces

Fig. 12. Bilinear projection model – the inferred most similar modes of variation contained within two subspaces representing face appearance variation of the same personin different illumination conditions and at different training scales. In each subfigure, which corresponds to a different training-query scale discrepancy, the top pairof images represents appearance extracted by the naïve algorithm of Section 2.1 (as the left-singular and right-singular vectors of BT

YBn

X); the bottom pair is extracted bythe proposed method (as the left-singular and right-singular vectors of BT

YBXc). (a) ð5� 5Þ⟷ð50� 50Þ, (b) ð10� 10Þ⟷ð50� 50Þ, (c) ð15� 15Þ⟷ð50� 50Þ,(d) ð20� 20Þ⟷ð50� 50Þ, (e) ð25� 25Þ⟷ð50� 50Þ.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–26722670

Page 10: Hallucinating optimal high-dimensional subspaces

bicubic. As could be expected from theory, the latter was found tobe consistently superior across all scales and for both data sets. Inthe case of face appearance, the greatest improvement over thenaïve re-projection method was observed for the smallest scale oflow resolution data – 8.5-fold separation increase was achieved for5�5 pixel images, 1.75-fold for 10�10, 1.25-fold for 15�15, 1.08-fold for 20�20 and 1.03-fold for 25�25. It is interesting to notethat the relative performance across different scales of lowresolution object data did not follow the same functional form asin the case of face data. A possible reason for this seemingly oddresult lies in the presence of confounding background regions(unlike in the face data set, which was automatically cropped toinclude foreground information only). Not only does the back-ground typically occupy a significant area of object images, but it isalso of variable shape across different views of the same object, aswell as across different objects. It is likely that the interaction ofthis confounding factor with the downsampling scale is the causeof the less predictable nature of the plots in Fig. 10.

The inferred most similar modes of variation contained withintwo subspaces representing face appearance variation of the sameperson in different illumination conditions and at different train-ing scales for the bilinear and bicubic models, respectively, areshown in Figs. 11 and 12. In both cases, as the scale of low-resolution images is reduced, the naïve algorithm of Section 2.1finds progressively worse matching modes with significant visualdegradation in the mode corresponding to the low-resolutionsubspace. In contrast, the proposed algorithm correctly recon-structs meaningful high-resolution appearance even in the case ofextremely low resolution images (5�5 pixels).

Lastly, we examined the behaviour of the proposed method inthe presence of data corruption by noise. Specifically, we repeatedthe previously described experiments for the bilinear projectionmodel with the difference that following the downsampling ofhigh resolution images we added pixel-wise Gaussian noise to theresulting low resolution images before creating the correspondinglow-dimensional subspaces. Since pixel-wise characteristics ofnoise were the same across all pixels in a specific experiment,

this noise is isotropic in the low-dimensional image space. Thesensitivity of the proposed method was evaluated by varying themagnitude of noise added in this manner. In particular, we startedby adding noise with pixel-wise standard deviation of 1 (i.e. imagespace root mean square equal to

ffiffiffiffiffiDl

p), or approximately 0.4% of

the entire greyscale spanning the range from 0 to 255, andprogressively increased up to 30 (i.e. image space root meansquare equal to 30

ffiffiffiffiffiDl

p), or approximately 12% of the possible

pixel value range which corresponds to the average signal-to-noiseratio of only 1.7. The results are summarized in the plot in Fig. 13which shows the change in class separation for different levels ofadditive noise. Note that for the sake of easier visualization in asingle plot, the change is shown relative to the separation attainedusing un-corrupted images, discussed previously and plotted inFig. 9(a). It is remarkable to observe that even in the mostchallenging experiment, when the magnitude of added noise isextreme, the performance of the proposed method is hardlyaffected at all. In all cases, including that when matching isperformed using low-dimensional subspaces with the greatestdownsampling factor, the average class separation is not decreasedmore than 1.5%. For pixel-wise noise magnitudes of up to 20greyscale levels, the deterioration is consistently lower than 0.5%,and even for the pixel-wise noise magnitude of 30 greyscale levelsthe separation decrease of more than 1% is observed in only twoinstances (for low-dimensional spaces corresponding to imagesdownsampled to 10�10 and 15�15 pixels). Note that this meansthat even when the proposed method performs matching in thepresence of extreme noise, its performance exceeds that of thenaïve approach applied to un-corrupted data.

4. Conclusion

In this paper a method for matching linear subspaces whichrepresent appearance variations in images of different scales wasdescribed. The approach consists of an initial re-projection of thesubspace in the low-dimensional image space to the high-dimensional one, and subsequent refinement of the re-projectionthrough a constrained rotation. Using facial and object appearanceimages and the corresponding two large data sets, it was shownthat the proposed algorithm successfully reconstructs the personalsubspace in the high-dimensional image space even for low-dimensional input corresponding to images as small as 5�5pixels, improving average class separation by an order of magni-tude. Our immediate future work will be in the direction ofintegrating the proposed method with the discriminative frame-work recently described in [23].

Conflict of interest

None declared.

Acknowledgements

The author would like to thank Trinity College Cambridge fortheir kind support and the volunteers from the University ofCambridge Department of Engineering whose face data wasincluded in the database used in developing the algorithmdescribed in this paper.

References

[1] V. Ferrari, T. Tuytelaars, L. Van Gool, Retrieving objects from videos based onaffine regions, in: Proceedings of European Signal Processing Conference(EUSIPCO), 2004, pp. 128–131.

0 5 10 15 20 25 300.98

0.985

0.99

0.995

1

1.005

1.01

Standard deviation of added Gaussian noise per pixel

Cla

ss s

epar

atio

n (r

elat

ive

to n

o ad

ded

nois

e)

5 × 510 × 1015 × 1520 × 2025 × 25

Fig. 13. The effects of additive zero-mean Gaussian noise, isotropic in the imagespace, applied to low resolution images before the construction of the correspond-ing subspaces. Shown is the change in the observed class separation which is forthe sake of visualization clarity measured relative to the separation achieved usingthe original, un-corrupted images; see Fig. 9(a). The results are for the bilinearprojection model. Notice the remarkable robustness of the proposed model: evenfor noise with the pixel-wise standard deviation of 30 greyscale levels (approxi-mately 12% of the entire greyscale intensity range), which corresponds to theaverage signal-to-noise ratio of 1.7, class separation is decreased by less than 1.5%.Note that this means that even when the proposed method performs matching inthe presence of extreme noise, its performance exceeds that of the naïve approachapplied to un-corrupted data.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–2672 2671

Page 11: Hallucinating optimal high-dimensional subspaces

[2] M. Everingham, A. Zisserman, C. Williams, C. Van Gool, et al., The 2005 PASCALvisual object classes challenge, in: Selected Proceedings of 1st PASCALChallenges Workshop, 2006.

[3] Y. Su, S. Shan, X. Chen, W. Gao, Hierarchical ensemble of global and localclassifiers for face recognition, IEEE Trans. Image Process. 18 (8) (2009)1885–1896.

[4] R. Pradhan, Z.G. Bhutia, M. Nasipuri, M.P. Pradhan, Gradient and principalcomponent analysis based texture recognition system: a comparative study,in: 5th International Conference on Information Technology: New Genera-tions, 2009, pp. 1222–1223.

[5] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed.,2004.

[6] P. Chen, D. Suter, An analysis of linear subspace approaches for computervision and pattern recognition, Int. J. Comput. Vis. 68 (1) (2006) 83–106.

[7] M. Bethge, Factorial coding of natural images: how effective are linear modelsin removing higher-order dependencies? J. Opt. Soc. Am. 23 (6) (2006)1253–1268.

[8] P.N. Belhumeur, D.J. Kriegman, What is the set of images of an object under allpossible illumination conditions? Int. J. Comput. Vis. 28 (3) (1998) 245–260.

[9] A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few to many: illumina-tion cone models for face recognition under variable lighting and pose, IEEETrans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660.

[10] R. Basri, D.W. Jacobs, Lambertian reflectance and linear subspaces, IEEE Trans.Pattern Anal. Mach. Intell. 25 (2) (2003) 218–233.

[11] K. Lee, M. Ho, J. Yang, D. Kriegman, Acquiring linear subspaces for facerecognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell.27 (5) (2005) 684–698.

[12] S. Zhou, G. Aggarwal, R. Chellappa, D. Jacobs, Appearance characterization oflinear lambertian objects, generalized photometric stereo, and illumination-

invariant face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2) (2007)230–245.

[13] D. Skočaj and A. Leonardis, Weighted and robust incremental method forsubspace learning, in: Image and Vision Computing (IVC), 2008, pp. 27–38.

[14] M. Song, H. Wang, Highly efficient incremental estimation of Gaussian mixturemodels for online data stream clustering, in: Proceedings of SPIE Conferenceon Intelligent Computing: Theory and Applications, 2005.

[15] O. Arandjelović, R. Cipolla, Incremental learning of temporally-coherentGaussian mixture models, in: Proceedings of British Machine Vision Con-ference (BMVC), vol. 2, 2005, pp. 759–768.

[16] J.J. Verbeek, N. Vlassis, B. Kröse, Efficient greedy learning of Gaussian mixturemodels, Neural Comput. 5 (2) (2003) 469–485.

[17] P. Hall, D. Marshall, R. Martin, Merging and splitting eigenspace models, IEEETrans. Pattern Anal. Mach. Intell. 22 (9) (2000) 1042–1048.

[18] R. Gross, J. Yang, A. Waibel, Growing Gaussian mixture models for poseinvariant face recognition, in: Proceedings of IAPR International Conference onPattern Recognition (ICPR), vol. 1, 2000, pp. 1088–1091.

[19] A.S. Householder, Unitary triangularization of a nonsymmetric matrix, J. ACM5 (4) (1958) 339–342.

[20] O. Arandjelović, Recognition from appearance subspaces across image sets ofvariable scale, in: Proceedings of British Machine Vision Conference (BMVC),2010, http://dx.doi.org/10.5244/C.24.79.

[21] O. Arandjelović, Computationally efficient application of the generic shape-illumination invariant to face recognition from video, Pattern Recognit. 45 (1)(2012) 92–103.

[22] J.M. Geusebroek, G.J. Burghouts, A.W.M. Smeulders, The Amsterdam library ofobject images, Int. J. Comput. Vis. 61 (1) (2005) 103–112.

[23] O. Arandjelović, Discriminative extended canonical correlation analysis forpattern set matching, Mach. Learn. 94 (3) (2013) 353–370.

Ognjen Arandjelović graduated top of his class from the Department of Engineering Science at the University of Oxford (M.E.). In 2007 he was awarded the Ph.D. degreefrom the University of Cambridge. After spending 4 years as a Fellow of Trinity College Cambridge, he moved to Swansea University as a Lecturer in Visual Computing.Currently he is a Senior Lecturer in Pattern Recognition and Data Analytics at Deakin University; he also holds the title of an Associated Professor at Université Laval. His mainresearch interests are computer vision and machine learning, and their applications in various fields of science. He is a Fellow of the Cambridge Overseas Trust and a winnerof multiple best research paper awards.

O. Arandjelović / Pattern Recognition 47 (2014) 2662–26722672