9
Sparse representations for spatial prediction and texture refinement Aurélie Martin a,b,, Jean-Jacques Fuchs a , Christine Guillemot a , Dominique Thoreau b a IRISA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex, France b Technicolor Research and Innovation labs, 1 avenue Belle Fontaine BP 19, 35510 Cesson-Sévigné Cedex, France article info Article history: Received 12 July 2010 Accepted 18 January 2011 Available online 26 January 2011 Keywords: Spatial prediction H.264/AVC Sparse representations Extrapolation Phase refinement Video compression Scalable coding Redundant dictionaries abstract In this work, we propose a novel approach for signal prediction based on the use of sparse signal repre- sentations and Matching Pursuit (MP) techniques. The paper first focuses on spatial texture prediction in a conventional block-based hybrid coding scheme and secondly addresses inter-layer prediction in a scal- able video coding (SVC) framework. For spatial prediction the signal reconstruction of the block to predict is based on basis functions selected with the MP iterative algorithm, to best match a causal neighborhood. Inter-layer MP based prediction employs base layer upsampled components additionally to the causal neighborhood in order to improve the representation of high frequencies. New solutions are proposed for efficiently deriving and exploiting the atoms dictionary through phase refinement and mono-dimen- sional basis functions. Experimental results indicate noticeable improvement of rate/distortion perfor- mance compared to the standard prediction methods as specified in H.264/AVC and its extension SVC. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction Digital compression has become an essential tool for transmis- sion and storage of increasingly large multimedia content. To meet these needs, the current standard for video compression, H.264/ AVC, is based on a predictive encoding to reduce the amount of information transmitted. An image prediction is generated, and then subtracted to the original to form a residual image containing little information. In particular, H.264/AVC intra prediction is based on the spread of neighboring pixels, along some predefined directions. Although very effective to extend pattern with the same unidirectional char- acteristics, this prediction has limited performance to extrapolate complex bi-dimensional signals. Similarly, the scalable extension of H.264/AVC (SVC) uses inter- layer prediction to reduce redundancies between successive spatial layers. The reconstructed base layer signal is upsampled to predict its higher spatial layer signal, using 4-taps FIR filters. This simple predictive approach can be significantly improved using upsam- pling techniques exploiting better the signal properties. The goal of sparse approximation techniques is to look for a linear expansion approximating the analyzed signal in terms of functions chosen from a large and redundant set. The Matching Pursuit (MP) algorithm has been applied to low bitrate video coding in [1–3]. Motion residual images are decomposed into a weighted summation of elements from a large dictionary of bi- dimensional Gabor structures. MP has also been applied to signal extension using Cosine and wavelet basis functions [4]. Sparse rep- resentations have also been applied to solve extrapolation and refinement issues. In [5] the authors propose a sparse image decomposition technique for image inpainting and denoising in one unified task. The image sparse approximation quality directly depends on the choice of basis functions. An algorithm for design- ing adaptive dictionaries has been proposed in [6] to form basis functions highly dependent on image structures. In this paper, we suggest the use of sparse representations for signal extrapola- tion and texture refinement for spatial predictions. We address both the spatial and inter-layer prediction issues of H.264/AVC and SVC previously pointed out. In order to better extrapolate complex bi-dimensional signals, this work offers new prediction tools based on sparse representations. Management and derivation of the sparse representation atoms dictionary are also studied and related new solutions are proposed. Performance of the proposed approaches are shown in a compression scheme based on H.264/AVC and SVC standards, with gains up to 10 % for intra prediction and about 3 % for inter-layer prediction. Notice- able visual improvements can also be observed. The efficiency of such predictions is based on the ability of ba- sis functions to properly extend textured signals of various kinds. Accordingly to this, we have also explored solutions to create panels of basis functions adapted for the textured areas prediction. 1047-3203/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2011.01.003 Corresponding author at: IRISA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex, France. E-mail address: [email protected] (A. Martin). J. Vis. Commun. Image R. 22 (2011) 712–720 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Sparse representations for spatial prediction and texture refinement

Embed Size (px)

Citation preview

Page 1: Sparse representations for spatial prediction and texture refinement

J. Vis. Commun. Image R. 22 (2011) 712–720

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier .com/ locate / jvc i

Sparse representations for spatial prediction and texture refinement

Aurélie Martin a,b,⇑, Jean-Jacques Fuchs a, Christine Guillemot a, Dominique Thoreau b

a IRISA/Université de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex, Franceb Technicolor Research and Innovation labs, 1 avenue Belle Fontaine BP 19, 35510 Cesson-Sévigné Cedex, France

a r t i c l e i n f o

Article history:Received 12 July 2010Accepted 18 January 2011Available online 26 January 2011

Keywords:Spatial predictionH.264/AVCSparse representationsExtrapolationPhase refinementVideo compressionScalable codingRedundant dictionaries

1047-3203/$ - see front matter � 2011 Elsevier Inc. Adoi:10.1016/j.jvcir.2011.01.003

⇑ Corresponding author at: IRISA/Université de Re35042 Rennes Cedex, France.

E-mail address: [email protected] (A. Martin

a b s t r a c t

In this work, we propose a novel approach for signal prediction based on the use of sparse signal repre-sentations and Matching Pursuit (MP) techniques. The paper first focuses on spatial texture prediction ina conventional block-based hybrid coding scheme and secondly addresses inter-layer prediction in a scal-able video coding (SVC) framework. For spatial prediction the signal reconstruction of the block to predictis based on basis functions selected with the MP iterative algorithm, to best match a causal neighborhood.Inter-layer MP based prediction employs base layer upsampled components additionally to the causalneighborhood in order to improve the representation of high frequencies. New solutions are proposedfor efficiently deriving and exploiting the atoms dictionary through phase refinement and mono-dimen-sional basis functions. Experimental results indicate noticeable improvement of rate/distortion perfor-mance compared to the standard prediction methods as specified in H.264/AVC and its extension SVC.

� 2011 Elsevier Inc. All rights reserved.

1. Introduction

Digital compression has become an essential tool for transmis-sion and storage of increasingly large multimedia content. To meetthese needs, the current standard for video compression, H.264/AVC, is based on a predictive encoding to reduce the amount ofinformation transmitted. An image prediction is generated, andthen subtracted to the original to form a residual image containinglittle information.

In particular, H.264/AVC intra prediction is based on the spreadof neighboring pixels, along some predefined directions. Althoughvery effective to extend pattern with the same unidirectional char-acteristics, this prediction has limited performance to extrapolatecomplex bi-dimensional signals.

Similarly, the scalable extension of H.264/AVC (SVC) uses inter-layer prediction to reduce redundancies between successive spatiallayers. The reconstructed base layer signal is upsampled to predictits higher spatial layer signal, using 4-taps FIR filters. This simplepredictive approach can be significantly improved using upsam-pling techniques exploiting better the signal properties.

The goal of sparse approximation techniques is to look for alinear expansion approximating the analyzed signal in terms offunctions chosen from a large and redundant set. The MatchingPursuit (MP) algorithm has been applied to low bitrate video

ll rights reserved.

nnes 1, Campus de Beaulieu,

).

coding in [1–3]. Motion residual images are decomposed into aweighted summation of elements from a large dictionary of bi-dimensional Gabor structures. MP has also been applied to signalextension using Cosine and wavelet basis functions [4]. Sparse rep-resentations have also been applied to solve extrapolation andrefinement issues. In [5] the authors propose a sparse imagedecomposition technique for image inpainting and denoising inone unified task. The image sparse approximation quality directlydepends on the choice of basis functions. An algorithm for design-ing adaptive dictionaries has been proposed in [6] to form basisfunctions highly dependent on image structures. In this paper,we suggest the use of sparse representations for signal extrapola-tion and texture refinement for spatial predictions.

We address both the spatial and inter-layer prediction issues ofH.264/AVC and SVC previously pointed out. In order to betterextrapolate complex bi-dimensional signals, this work offers newprediction tools based on sparse representations. Managementand derivation of the sparse representation atoms dictionary arealso studied and related new solutions are proposed. Performanceof the proposed approaches are shown in a compression schemebased on H.264/AVC and SVC standards, with gains up to 10 %for intra prediction and about 3 % for inter-layer prediction. Notice-able visual improvements can also be observed.

The efficiency of such predictions is based on the ability of ba-sis functions to properly extend textured signals of various kinds.Accordingly to this, we have also explored solutions to createpanels of basis functions adapted for the textured areasprediction.

Page 2: Sparse representations for spatial prediction and texture refinement

A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720 713

The remainder of the article is organized as follows. The conceptof sparse representations is first recalled in Section 2. The adapta-tion to the prediction problem is presented in Section 3. The appli-cation consisting in a refinement of inter–intra layer in the SVCscheme is explained in Section 4. In Section 5, dictionary issuesfor texture extrapolation are approached. Results are presentedin Section 6 and conclusion in Section 7.

2. Sparse representations

2.1. Introduction

Let y be a vector of dimension n and A a matrix of dimensionn �m with m� n. The columns ak of A can be seen as basis func-tions or atoms of a dictionary that will be used to represent thevector y. Note that there are infinite numbers of ways to choosethe m-dimensional vector x such that y = Ax. The aim of sparse rep-resentations is to search among all these solutions of y = Ax thosethat are sparse, i.e. those for which the vector x has only a smallnumber of nonzero components. Indeed one quite generally doesnot seek an exact reconstruction but rather seeks a sparse repre-sentation that satisfies:

ky� Axk22 6 q;

where q characterizes an admissible reconstruction error. Sincesearching for the sparsest representation satisfying this constraintis NP-hard and hence computationally intractable, one seeksapproximate solutions.

2.2. Matching Pursuit algorithm (MP)

The MP algorithm [7] offers a sub-optimal solution to this prob-lem via an iterative algorithm. It generates a sequence of m-dimen-sional vectors xk having an increasing number of non zerocomponents in the following way.

At the first iteration x0 = 0 and an initial residual vectorr0 = y � Ax0 = y is computed. At iteration k, the algorithm selectsthe basis function ajk having the highest correlation with the cur-rent residual vector rk�1 = y � Axk�1, that is, such that:

jk ¼ arg maxj

jaTj rk�1jaT

j aj:

The weight xjk of this new atom is then chosen so as to minimize theenergy of the new residual vector, which becomes thus equal to:

rk ¼ rk�1 �aT

j rk�1

aTj aj

ajk :

The new optimal weight is introduced into xk�1 to yield xk. Note thatthe same atom may be chosen several times by MP. In this case, thevalue of the coefficient is added to the previous one. The algorithmproceeds until the stopping criterion:ky� Axkk2

6 q ð1Þ

is satisfied, where q is a tolerance parameter which controls thesparseness of the representation.

2.3. Global Matched Filter (GMF)

The GMF introduced in [8] is an interesting alternative to theMP algorithm. The GMF algorithm yields the sparse representationthat minimizes the criterion:

min12ky� Axk2

2 þ hkxk1 with h > 0; ð2Þ

where kXk1 ¼Pjxjj and h > 0 is a threshold which controls the

sparseness of the representation. Indeed (2) can be rewritten as:

min kAxk22 subject to kATðy� AxÞk1 6 h;

where the constraint can be given the following physical interpre-tation. At a point, say x̂ satisfying it,

� y� Ax̂ is the residual vector which can be seen as the unex-plained part y,� ATðy� Ax̂Þ is a vector containing all the correlations of the

atoms with the residual vector,� the constraint guarantees that, at an admissible point x̂, no com-

ponent in this vector exceeds h.� one finally seeks the representation satisfying this constraint

that has minimal energy.

At the optimum all the used atoms have indeed their correlationequal to h. As opposed to MP which is an ad hoc procedure, GMF isoptimal. The advantage of GMF, compared to MP, is that for one va-lue of h, the best elements of A are simultaneously selected, insteadof choosing the atoms one by one.

The price to pay for this difference is a higher computationalburden, although there are now quite efficient ways to implementGMF. The criterion (2) has appeared somehow simultaneously indifferent communities and is sometimes known as Basis-Pursuitdenoising [9]. An efficient manner to implement the GMF, thatwe will be using in the sequel, has been developed in the statisticscommunity [10] (see also [11]). Although one wants to solve (2) fora fixed h, the algorithm works iteratively in the number of compo-nents (just as MP) and starts with h = kATyk1 for which a first nonzero component appears in the component of the optimal x. Thevalue of h is then decreased and the next value of h for which a sec-ond component becomes non zero is found. The algorithm pro-ceeds in this way until the desired value of h falls within thecurrent interval in h. One searches (adjacent) intervals of h withinwhich – the number of nonzero components in the optimum x re-mains constant and – an explicit expression of the optimum isknown. Using this quite efficient algorithm has indeed a furtheradvantage: it allows to build a sequence, say xk of optimal repre-sentations with increasing complexity. The basic MP algorithmproceeds in the same way but at step k, the xk it generates hasno real optimality property.

Remarks: In the sparse representation context, it is importantto note that larger the m, the number of components (atoms) inthe redundant basis (dictionary), the smaller the number of com-ponents required in a potential ‘‘good’’ representation, but alsothe higher the computational complexity. Hence it is essential tochoose the dictionary carefully or better to adapt it to the signalsto be represented. In the sequel we will essentially use the DiscreteCosine Transform basis and the Discrete Fourier Transform basis inthe real pixel-domain, i.e., the vector y is filled with pixels of theavailable blocks of the image under investigation.

3. Spatial prediction based on sparse representations

The use of sparse representation is first described for spatialtexture prediction. The basic considered framework is H.264/AVC[12]. As previously pointed out, intra prediction in H.264/AVC isbased on directional modes that propagate the signal along givendirections. Such solutions are not able to correctly handle textureswith more than one oriented pattern. The method proposed in thispaper aims at avoiding this limitation and enabling to handle com-plex textures.

3.1. Principle

Let y be a m-dimensional vector filled with neighboring obser-vations which n < m samples are non zero and m � n samples are

Page 3: Sparse representations for spatial prediction and texture refinement

Fig. 1. Example of causal neighborhood.

714 A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720

equal to zero. These null value data are the unknown pixels to pre-dict. The principle of the technique is to extrapolate the m � n un-known data based on the knowledge of the n non zero samples,thanks to basis functions in the dictionary A.

One builds the m �m matrix A in which each column is a 2Dvectorized basis function of dimension m. To get a redundant dic-tionary, called Ac, the m � n rows of A, corresponding to the m � nnull values, are masked. Ac is then a n �m-dimensional matrix. Theredundancy of Ac is directly dependent on the number of zero val-ues. Similarly, the same components in y are deleted to form the n-dimensional vector yc.

The goal is then the search of an approximation of yc, in terms ofbasis functions chosen in the redundant dictionary Ac. Hence thesparsest solution vector x⁄ is found thanks to algorithms such asthe MP or the GMF algorithm, the m-dimensional vector ye is gen-erated thanks to the dictionary A:

ye ¼ Ax�:

ye is a m-dimensional vector: the extrapolation of the n knownpixels gives an approximation of the previous m � n data initiallyequal to zero.

3.2. Causal neighborhood

Previously reconstructed pixels surrounding the current block Pto predict form the causal neighborhood as depicted in Fig. 1. In thecontext of the proposed spatial prediction based on sparse repre-sentations, the purpose is to form a prediction for complex texturessuch as bi-dimensional patterns so as to minimize the residual im-age energy. One needs to enlarge the causal neighborhood used byH.264/AVC intra prediction to entirely catch the pattern of the tex-ture to extrapolate. We used a causal neighborhood composed ofseveral adjacent blocks of pixels, colored in blue1 in Fig. 2. Theamount of known samples depends on the block size and the chosenscan order of each block inside a macroblock (16 � 16 pixels). Thedictionary Ac is a n �m matrix where n is the number of non-zerocomponents inside the neighborhood and m is the horizontal dimen-sion of the neighborhood composed of 3, 4 or 5 adjacent blocks. Theblock size N � N depends on the chosen type of prediction: 4 � 4,8 � 8 or/and 16 � 16.

3.3. Non causal stopping criterion

Chosen algorithms (MP or GMF) stop as soon as a criterion (1),(2) based on the approximation of local data in yc is fulfilled. Themajor point to highlight is that a good approximation of the causalneighborhood does not lead, most of the time, to the best predic-tion. Then it is not relevant for prediction to keep a stopping crite-rion based on causal neighborhood.

Therefore we apply to both MP and GMF a stopping criterionthat tends to fulfill this goal, i.e., that tends to minimize the recon-

1 For interpretation of color in Figs. 2, the reader is referred to the web version ofthis article.

struction error in the current block P. Let yp be the vector filledwith the N � N samples corresponding to the considered sourceblock P. Ap is a compacted dictionary extracted from A composedof the same m basis functions but with N � N rows which spatiallycorrespond to the P area. We implement both algorithms so thatthey generate a sequence of representations xk of increasing com-plexity and for each xk we compute the prediction error energy:

kyp � Apxkk2

and we should thus stop as soon as this prediction error whichgenerically starts decreasing, increases. But since there is no reasonthat a more complex representation cannot indeed yield a smallerprediction error, we actually proceed differently and consider atwo steps procedure.

First the MP and the GMF algorithms are run until the pre-spec-ified threshold is reached and the resulting xk sequences are stored.The values of the thresholds are fixed such that the final represen-tation has a quite large number of components, say kmax. In a sec-ond step one then selects the optimal representation as the onethat gives the smallest error energy on the area P to be predicted:

kopt ¼ mink2½1;kmax �

kyp � Apxkk22:

Since components of yp refer to source pixels, this is a non-causalprocedure. The optimal number of atoms kopt has to be transmittedto the decoder to compute the same prediction.

Nevertheless, having a reconstruction error that is minimal doesnot necessarily lead to the best rate/distortion compromise. Then,we also introduced a stopping criterion based on the minimizationof a lagrangian cost function. The aim is to find the best compro-mise in terms of rate, R, and distortion. Thus, we considered thetwo following criterions:

MSE based : kopt ¼ mink2½1;kmax �

kyp � Apxkk22; ð3Þ

Lagrangian based : kopt ¼ mink2½1;kmax �

kyp � Apxkk22 þ kR: ð4Þ

3.4. Non-stationary neighborhood

One theoretical hypothesis related to the technique is to assumethe local texture signal is quasi-stationary. If this hypothesis is ful-filled, the extrapolated textured could lead to a good prediction forthe current block. In case of non-stationary signal it is not relevantto keep all the observations of the chosen causal neighborhood.Pixels non correlated enough with the block to predict representparasite noise.

We suggest to split the causal neighborhood into blocks, repre-senting then as many new support areas for the extrapolationprocess. Fig. 3 presents different cases of non-stationary neighbor-hoods. Green blocks suggest the set of pixels supposed to be themost correlated with the current block. For each of these smallneighborhoods, we generate a prediction for the current block.The second step is to choose among all these predictions whichone is the best according to a rate/distortion criterion. Then theoptimal prediction is set on competition with other intra predic-tion modes. The index of the neighborhood leading to the bestrate/distortion performance is encoded and then transmitted tothe decoder.

Results about sparse representations based spatial predictionare presented in Section 6.1.

4. Sparse representation for inter-intra layer prediction

We have first addressed the spatial prediction using sparse rep-resentation. The proposed concept can be generalized in a scalable

Page 4: Sparse representations for spatial prediction and texture refinement

Fig. 2. Selected causal neighborhood composed of 3, 4 or 5 adjacent blocks.

A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720 715

video coding framework. Our goal is to outperform usual inter-layer prediction methods by properly and jointly exploiting spatialand inter-layer predictions using sparsity properties of the signal.After a short description of the SVC inter-layer prediction process,our developed approach is presented in the following sub-sections.

4.1. Spatial scalability in SVC

Scalable video coding scheme has been approved as an exten-sion of H.264/AVC standard since July 2007 [13]. In such a schemean unique bitstream can be decoded at different spatio-temporalresolutions. SVC provides a large degree of flexibility in terms ofscalability: temporal, spatial and quality scalability. We will focushere on improving the spatial scalability performance.

SVC standard bases its spatial prediction on dependencies be-tween different resolutions. Each current macroblock is predictedwith up-sampled lower resolution signal (e.g. base layer). To im-prove coding efficiency, three upsampling processes are combined:texture, residual and motion.

The upsampling process on luminance component, textureupsampling, is performed with one dimensional 4-taps FIR filters,horizontally and vertically, on intra based predicted blocks. Thechrominance components are upsampled with bilinear filters.The encoder also performs an intra prediction thanks to the direc-tional modes of H.264/AVC standard. The best prediction is thenselected. If the current block corresponds to an inter-coded macro-block in the base layer, the enhancement layer macroblock is inter-coded. The inter-layer residual prediction can also be employed forinter or intra coded blocks to improve scalable coding.

4.2. Spatial SVC prediction with MP

The main goal of our approach is to take advantage of two pre-dictions: intra and spatial inter-layer predictions. As showed inSection 3, the information of neighbor pixels previously recon-structed is, in most cases, sufficient to recollect the unknown cur-rent block. When the signal to predict corresponds to new patternsor appearing edges (cf. Section 3.4), the reconstruction is quiteimpossible. In SVC, the current picture at a lower resolution, thebase layer, is available. Components of the upsampled base layercan be used in addition to adjacent pixels in the current picture.These pixels represent a major source of information, especiallywhen the signal is not predictable with only neighbor pixels.

We modify the algorithm to take upsampled base layer pixelsinto account, as depicted in Fig. 4. Pixels surrounding the currentblock are still considered and we also insert pixels of the upsam-pled base layer in the input vector y. As supplementary informa-tion about the current block to predict is available, the algorithmis expected to have better performance. Indeed, high frequencies

of the texture in the casual neighborhood are reallocated to thebase layer signal. Since there is no zero components in y anymoreand to keep working with a redundant dictionary (as explained inSection 3.1, the redundancy of the dictionary depends on zerocomponents), we add basis functions to form a larger dictionary,called A2.

The performance of the proposed joint intra-inter-layer predic-tion is measured against SVC and is presented in Section 6.2.

5. Redundant dictionaries for prediction

As exposed in previous sections, our proposed MP algorithm se-lects the most correlated atoms using neighbor samples. The qual-ity of the extrapolated signal is then highly dependent on thenature of selected atoms. The quest of sparseness is achieved ifthe considered atom well suits the signal. Actually, one of the bigissues of sparse representations is to be able to determine a rele-vant set of basis functions to represent any kind of features inimages. The ideal dictionary is composed of smooth functions torecollect low frequencies, other functions more spatially locatedfor high frequencies, such as edges or contours. Many dictionarywith various types of atoms are proposed [14]: Gaussian functions,Gabor or anisotropic atoms, wavelets, curvelets, bandelets etc.

The difficulty lies on few questions: how many different func-tions do we have to consider for being sparse? How do we choosethe different types of functions inserted in the dictionary? Most ofthe time, a compromise between complexity and quality needs tobe found.

Two tracks have been explored to solve these issues: the firstidea is to virtually extend the dictionary size by controlling thephase of the input atoms; a second studied solution is to usemono-dimensional basis functions that actually enable to betterfit the signal phase and orientation variations.

5.1. Phase refinement

In this section, we address the problem of finding a non exhaus-tive solution to this question. We propose to work with a limiteddictionary and to virtually increase its redundancy by phasingatoms. The main idea is to best fit the input data by searchingthe appropriate spatial phase thanks to phase correlation [15].

This frequency domain approach estimates the relative transla-tive motion between two images. In the context of sparse predic-tion, our goal is to estimate the shift between the current signalobservation (at first step, input signal is the causal neighborhooditself and then at other steps, it corresponds to the residual signal)and the selected bi-dimensional basis function. The spatial shift of

Page 5: Sparse representations for spatial prediction and texture refinement

Fig. 3. Example of non-stationary causal neighborhoods: green arrows suggest the best block of the neighborhood to use in order to form the current block prediction. (Forinterpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 4. Spatial prediction based sparse representations in a SVC like scheme.

716 A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720

the two signals is reflected as a phase difference in the Fourierdomain.

We first chose to insert the phase correlation process after thechoice of the best correlated atom by the MP. Note that it is an aposteriori treatment necessarily sub-optimal compared to in looprefinement. The first step is to determine the bi-dimensional shiftthat might be present between the input data and the selectedatom. Let Finput be the Fourier transform of the input signal and Fa-

tom the Fourier transform of the basis function. The cross-powerspectrum is defined as followed:

C ¼ FinputF�atom

FinputF�atom

�� �� ;with F�atom the complex conjugate of Fatom. The correlation c betweenthe two signals is obtained by calculating the inverse Fourier trans-form of C. Then, the bi-dimensional location of the peak in c isdetected:

fDx;Dyg ¼ arg maxx;yfcg:

The second step consists in generating a phased atom takinginto account the values of the spatial shift {Dx,Dy}. Since the theo-retical expressions of basis functions in the dictionary are known,there is no ambiguity to calculate the new shifted functions.

5.2. Mono-dimensional Cosine functions

Previous section deals with a phase refinement process appliedto basis functions to enhance the match between signal and atoms.In the same spirit, the following basis functions a1D are considered:

a1Dðx; kÞ ¼ffiffiffiffiffi2m

rcos xþ 1

2

� �pkm

� �: ð5Þ

These basis functions are bi-dimensional Cosine functions whereone frequency is set to zero as depicted in Fig. 5. In the sequel, theseatoms are called mono-dimensional functions. These basis functionspresent a large variety of angles and offer in addition the possibilityto form various spatial phases. Since algorithms, such as the MPalgorithm, form a sparse representation thanks to linear combina-tions, these atoms are expected to be relevant to represent bi-dimensional patterns of different spatial phase.

Section 6.3 further comments these two solutions on practicalexamples.

6. Experimental results

6.1. Results about intra prediction based on sparse representations

The proposed spatial prediction based on sparse representa-tions was integrated in KTA 1.2 (Key Technical Area) software[16] without any change of the encoder syntax. Indeed the pro-posed prediction mode substitutes for one H.264/AVC mode foreach type of prediction (the substituted mode having being chosenas the statistically less used AVC mode). We consider the spatialprediction of blocks of 4 � 4, 8 � 8, and 16 � 16 pixels. The Cosinefunctions have been used to construct the redundant dictionary A.The threshold is set to a value that yields a final representationhaving k, a quite large number, of non zero components. Thenthe vector x related to the optimal representation is selected, see(3) and (4). In all our simulations q and h are set to 1 in (1) and (2).

Simulations were performed on a large range (21–26–30–35) ofquantization levels to evaluate the Bjontegaard (BD) [17] averagePSNR improvement of luminance components and bitrate savings.Results present BD gains taking into account the overcost related toour non-causal stopping criterion (Section 3.3). Note that the over-

Page 6: Sparse representations for spatial prediction and texture refinement

Fig. 5. Some bi-dimensional Cosine Functions (left) and mono-dimensional atoms(right).

Table 1BD results for the 8 � 8 sparse prediction.

MP, Criterion (3) Without overcost With overcost

Gain (dB) Gain (%) Gain (dB) Gain (%)

Barbara +0.43 �5.61 +0.29 �3.75Pool +0.77 �8.38 +0.54 �5.94Wool +1.00 �12.75 +0.79 �10.06

Table 2BD results for different type of intra predictions and different algorithms.

Barbara, Criterion (3) MP GMF

Gain (dB) Gain (%) Gain (dB) Gain (%)

4 � 4 +0.72 �8.55 +0.62 �7.268 � 8 +0.43 �5.61 +0.44 �5.744 � 4, 8 � 8 and 16 � 16 +0.59 �7.60 +0.53 �6.91

A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720 717

cost is based on a simple entropy calculation. Further work has tobe done in order to reduce this overcost. Therefore we also presentBD gains without the value of this overcost to expose the potentialof our prediction. Fig. 6 presents the test images used for the sim-ulations. Table 1 shows BD gains on these test pictures with the MPalgorithm, Cosine functions in dictionary A and a MSE based crite-rion (3). The higher rate savings are obtained on the high texturedWool picture: about 10 % rate savings, taking into account the over-cost related to the non-causal stopping criterion (corresponding toan additional information sent to the decoder).

Table 2 presents results with the MP and GMF algorithm whendifferent types of prediction (4 � 4, 8 � 8, and 16 � 16) are com-bined. BD rate savings are similar for both algorithms. Althoughthe GMF algorithm leads to a sparser solution x than the MP algo-rithm, results show that the sparsest solution does not necessarilylead to the best prediction. Fig. 7 shows rate/distortion curves forthe H.264/avc spatial 8 � 8 prediction and the sparse representa-tions based method with the MP algorithm. Different solutionsare considered: with the EQM (3) or the Lagrangian (4) criterion,taking into account or not, the overcost induced by the non stop-ping criterion. This overcost is quite penalizing at low bit rates. Be-sides, since the reconstructed causal neighborhood is moredamaged at low bit rates, the signal reconstruction is less efficient.So that the sparse representations based technique seems dedi-cated to high rate coding.

To deal with the non-stationary issue, the local neighborhoodwas split (Section 3.4). As many predictions as the number ofblocks of the neighborhood (3,4 or 5 blocks, see Fig. 2) are compet-ing and the best of our prediction is chosen thanks to a rate/distor-tion Lagrangian criterion (4). The index of the best prediction hasto be coded. Then, this selected prediction substitutes one H.264/AVC mode. To be coherent, the rate/distortion mode decision ofKTA software is turned on. Results presented in Table 3 take intoaccount a supplementary cost related to the selection of the bestprediction based on a split neighborhood and the overcost relatedto the non-causal stopping criterion.

Fig. 6. Test pictures: left, Barbara (512 � 512); middl

Note that the computation of the whole set of scalar products isa quite computationally complex task which requires O((n �m)2)operations for each block. In order to lessen the complexity of com-puting scalar products, the use of a fast convolution algorithmbased on the Fourier transform could be considered [18]. The com-plexity of computing scalar products would be then substituted byO(n �mlog2(n �m)).

6.2. Results for the joint spatial inter-layer approach

This section presents results of the second proposed approach,that relates to spatial inter-layer MP based prediction, see Sec-tion 4.2. The algorithm runs both with the knowledge of surround-ing reconstructed pixels and lower resolution upsampledcomponents. In order to compare our method against the usual in-ter–intra layer SVC spatial prediction, our experiments have notbeen explicitly performed compared to a standard SVC codingscheme. Nevertheless a very similar scheme has been used as ref-erence to measure the performance of our proposed method. TheSVC scheme has been emulated using the following approach. Firstthe full resolution picture is downsampled to a dyadic resolutionusing a windowed cardinal sinus as a downsampling filter [19].This base layer picture is coded with the same quantization param-eter value as the enhanced layer. The coded base layer is thenupsampled using a Lanczos normalized filter [20].

For coding the enhancement layer, the less used AVC intra modeis replaced inter-layer prediction mode that uses the upsampledbase layer signal. This process is actually very close to the onespecified in the SVC standard, with a light difference regardingthe upsampling filter, and the removal of one AVC mode that isused here for the inter-layer prediction.

Regarding our solution, a similar process is applied. First theless used AVC mode (mode 6 [12]) is replaced by our new joint in-

e, Wool (720 � 576) and right, Pool (720 � 576).

Page 7: Sparse representations for spatial prediction and texture refinement

Fig. 7. Rate distortion curves for the 8 � 8 spatial MP based prediction on Barbara (512 � 512).

Table 3BD results with a split neighborhood.

Barbara, MP algorithm Without overcost With overcost (s)

Gain(dB)

Gain(%)

Gain(dB)

Gain(%)

No split neighborhood – Criterion(4)

+0.57 �7.38 +0.37 �4.84

Split neighborhood – Criterion (4) +0.77 �9.98 + 0.47 �6.08

Fig. 8. On the left: spatial inter-layer prediction of SVC; on the right, predictionbased on sparse representations where high frequencies are recovered. (The baselayer is encoded with the quantization parameter set to 15).

Table 4BD for the MP based spatial inter-layer prediction.

Barbara, MP algorithm, Criterion (3), A2 Gain (dB) Gain (%)

Without overcost +0.64 �8.58With overcost +0.24 �3.09

718 A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720

tra–inter-layer mode. We also choose to introduce the predictionfrom the upsampled base layer in place of the second statisticallyless chosen AVC mode (mode 8 [12]). Compared to the consideredreference method, the encoder runs with the MP prediction but thedifference is that non causal area surrounding the current block isfilled with pixels of the upsampled image of the base layer (Fig. 4).The redundant dictionary A2 is made up of Cosine functions and bi-dimensional basis functions a2D inspired from the Hartleytransform:

a2Dðx;y;k;lÞ¼ð�1Þxþy

ncosðpxcÞ

�sinðpycÞ cos2pnðxckcþyclcÞ

� �þsin

2pnðxckcþyclcÞ

� �� ;

where ⁄c = ⁄ � n/2. These directional functions are interesting to re-cover linear textures.

Fig. 8 shows the performance of our prediction competing withthe upsampled base layer. The Lanczos upsampling filter well suitsto recover contours between two different textured areas. How-ever, at a high quantization level, the texture is lost. Thereforethe MP algorithm offers an interesting alternative to spatiallyrecollect the signal. The objective performance is measured on Bar-bara picture, with same QPs as in previous section, see Table 4.When not considering the overcost of extra-information signaling,the BD gain is of 8.58 %. The gain reduces to 3.09% when extra-information signaling coding cost is considered. It is observed thatthe method performs better at high bitrates, when spatial textureof the high layer can really bring significant finer texture comparedto the base layer. At low bit-rates, a simple upsampling-based pre-diction as used in SVC appears to be sufficient.

6.3. Results about redundant dictionaries

6.3.1. Results about phase refinementConcerning the phase refinement tool, we first test the algo-

rithm outside the encoding loop. The dictionary is composed of Co-sine functions. To increase the reliability of the process, we use

sub-pixel accuracy of the correlation peak (based on a parabola fit-ting technique [21]) to find the spatial shift.

Page 8: Sparse representations for spatial prediction and texture refinement

Fig. 9. MP based prediction with a phase refinement process. (a) source image; MP based prediction without (b) and with (c) phase refinement process and respectiveresidual images (d) = (a) � (b) and (e) = (a) � (c).

Table 5BD results with mono-dimensional atoms a1D and a weighted function W.

Barbara Without W

Without overcost With overcost

MP – Criterion (3) – DCT 2D +0.43 �5.61 +0.29 �3.75MP – Criterion (3) – a1D +0.55 �7.17 +0.43 �5.60

Barbara With W

Without overcost With overcost

MP – Criterion (3) – DCT 2D +0.56 �7.27 +0.41 �5.33MP – Criterion (3) – a1D +0.73 �9.52 +0.59 �7.70

A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720 719

Fig. 9 shows the interest of using a phase refinement process.Residual frame (e), the difference between the source (a) and theMP based prediction with phase refinement (c) presents less coef-ficients than the residual image formed with the source image (a)and the MP based prediction without phase refinement (b). Preli-minary results with half-pel estimation, show an encouragingimprovement on prediction image. Further works will focus onshift determination. For instance, looking for the barycenter ofthe neighboring maximal energy peaks of the correlation peak.

6.3.2. Results about mono-dimensional Cosine functionsExperimental protocol is the same as the one presented in Sec-

tion 6.1. The dictionary A is filled with basis functions a1D (5) andthe MP algorithm is used. To deal with the non-stationary neigh-borhood issue, a weighted Gaussian function

wðx; yÞ ¼ Cffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðx�m=2Þ2þðy�m=2Þ2p

, where C is a constant set to 0.88 andm is the size of the neighborhood, is used. w is diagonalized to formmatrix W which is applied to the input signal y and to the dictio-nary A:W � (y � Ax).

Table 5 presents BD performance on Barbara. These mono-dimensional atoms improve the quality of the coding process by1.85 % in terms of rate without the weighted function and by2.37 % with W. Linear textures of Barbara’s pants are better recov-ered and also are the bi-dimensional pattern of the chair in thebackground of the picture.

7. Conclusion

This paper presented a new approach of spatial predictionbased on sparse representations. Different applications of thisscheme were considered. Our new intra prediction offers interest-ing perspectives compared to directional modes of H.264/AVC.Regarding complex textures, the MP algorithm turns out to be aninteresting alternative for intra prediction and also for the SVC spa-tial inter layer prediction. Simulation results show bitrate savingsup to 10 % in H.264/AVC with the MP based prediction. First resultsshow that split neighborhood solve some issues related to non-sta-tionary textures. Further work could deal with the choice of neigh-borhood containing uniform textures. Such texturecharacterization may be a way to improve the basis functionsselection. Concerning the SVC application, the knowledge of noncausal components allows the MP algorithm to extrapolate localtextures more precisely. Thanks to the phase refinement process,we virtually increase the redundancy of the dictionary and then,avoid running the MP with a too large set of basis functions. Be-sides proposed mono-dimensional functions show interesting re-sults to represent bi-dimensional textures of different spatialphases. Taking into account the high complexity of the MP algo-rithm, compared to the intra prediction of H.264/AVC, we shouldexplore a trade-off between the coding efficiency and the complex-ity of the MP. Finally the dictionary choice is the major issue thatneeds to be solved to improve performance. Recent works[22,23] suggest the use of spatial dictionaries formed with thesource image. Basis functions are adaptively extracted from thisdictionary. One interest is to form atoms of different sizes and spa-tial phases. Such basis functions allow to better fit texturecharacteristics.

References

[1] R. Neff, A. Zakhor, Matching pursuit video coding at very low bit rates, in:Proceedings of Data Compression Conference, 1995, pp. 411–420.

[2] B. Wang, Y. Wang, P. Yin, A two pass h.264-based matching pursuit videocoder, in: Proceedings of the International Conference on Image Processing,ICIP, 2006, pp. 3149–3152.

Page 9: Sparse representations for spatial prediction and texture refinement

720 A. Martin et al. / J. Vis. Commun. Image R. 22 (2011) 712–720

[3] L. Granai, E. Maggio, L. Peotta, P. Vandergheynst, Hybrid video coding based onbidimensional matching pursuit, Journal on Applied Signal Processing (2004)2705–2714.

[4] U. Desai, Dct and wavelet based representations of arbitrarily shaped imagesegments, in: Proceedings of the IEEE International Conference on ImageProcessing, 1995, pp. 558–561.

[5] M. Elad, J.-L. Starck, P. Querre, D. Donoho, Simultaneous cartoon and textureimage in painting using morphological component analysis (mca), Applied andComputational Harmonic Analysis 19 (2005) 340–358.

[6] M. Aharon, M. Elad, A. Bruckstein, The k-svd: an algorithm for designing ofovercomplete dictionaries for sparse representation, IEEE Transactions onSignal Processing 54 (2006) 4311–4322.

[7] S. Mallat, Z. Zhang, Matching pursuits with time frequency dictionaries, IEEETransactions on Signal Processing 41 (1993) 3397–3415.

[8] J.-J. Fuchs, On the application of the global matched filter to doa estimationwith uniform circular arrays, IEEE Transactions on Signal Processing (2001)702–709.

[9] S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pursuit, SIAMJournal on Scientific Computing 20 (1998) 33–61.

[10] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Annals ofStatistics 32 (2004) 407–499.

[11] S. Maria, J.-J. Fuchs, Application of the global matched filter to stap data, anefficient algorithmic approach, in: Proceedings of ICASSP, 2006.

[12] T. Wiegand, G.J. Sullivan, G. Bjntegaard, A. Luthra, Overview of the h.264/avcvideo coding standard, IEEE Transactions on Circuits and Systems for VideoTechnology 13 (2003) 560–576.

[13] H. Schwarz, D. Marpe, T. Wiegand, Overview of the scalable video codingextension of the h.264/avc standard, IEEE Special Issue on SVC 17 (2007)1103–1120.

[14] R.F. i Ventura, P. Vandergheynst, P. Frossard, Low-rate and flexible imagecoding with redundant representations, IEEE Transactions on ImageProcessing 15 (2006) 726–739.

[15] C. Kuglin, D. Hines, The phase correlation image alignment method,Proceedings of IEEE – Cybernetics and Society (1975) 163–165.

[16] <http://iphome.hhi.de/suehring/tml/download/KTA/>.[17] G. Bjontegaard, Calculation of average psnr differences between rd curves, ITU-

T VCEG-M33, 2001.[18] R. Figueras i Ventura, O. Divorra Escoda, P. Vandergheynst, A matching pursuit

full search algorithm for image approximations, Technical report, EPFL, 2004.[19] S. Sun, J. Reichel, Ahg report on spatial scalability resampling, ISO/IEC MPEG

and ITU-T VCEG – Joint Video Team, Doc. JVT-R006, 2006.[20] A. Segall, J. Zhao, Ce4: Evaluation of texture upsampling with 4-tap cubic-

spline filter, ISO/IEC MPEG and ITU-T VCEG – Joint Video Team, Doc. JVT-U042,2006.

[21] Q. Tian, M. Huhns, Algorithms for subpixel registration, Computer Vision,Graphics and Image Processing 35 (1986) 220–233.

[22] M. Aharon, M. Elad, Sparse and redundant modeling of image content using animage-signature-dictionary, SIAM Journal on Imaging Sciences (2007).

[23] J. Zepeda, C. Guillemot, E. Kijak, The iteration-tuned dictionary for sparserepresentations, in: Proceedings of the IEEE International Workshop on MMSP,2010.