Deep Convolutional AutoEncoder-based Lossy …Deep Convolutional AutoEncoder-based Lossy Image Compression Zhengxue Cheng , Heming Sun, Masaru Takeuchi , and Jiro Katto Graduate School

Deep Convolutional AutoEncoder-based LossyImage Compression

Zhengxue Cheng∗, Heming Sun, Masaru Takeuchi∗, and Jiro Katto∗∗Graduate School of Fundamental Science and Engineering, Waseda University, Tokyo, Japan

Email: [email protected], [email protected], [email protected], [email protected]

Abstract—Image compression has been investigated as a funda-mental research topic for many decades. Recently, deep learninghas achieved great success in many computer vision tasks, andis gradually being used in image compression. In this paper, wepresent a lossy image compression architecture, which utilizes theadvantages of convolutional autoencoder (CAE) to achieve a highcoding efficiency. First, we design a novel CAE architecture toreplace the conventional transforms and train this CAE using arate-distortion loss function. Second, to generate a more energy-compact representation, we utilize the principal componentsanalysis (PCA) to rotate the feature maps produced by theCAE, and then apply the quantization and entropy coder togenerate the codes. Experimental results demonstrate that ourmethod outperforms traditional image coding algorithms, byachieving a 13.7% BD-rate decrement on the Kodak databaseimages compared to JPEG2000. Besides, our method maintainsa moderate complexity similar to JPEG2000.

I. INTRODUCTIONImage compression has been a fundamental and significant

research topic in the field of image processing for severaldecades. Traditional image compression algorithms, such asJPEG [1] and JPEG2000 [2], rely on the hand-crafted en-coder/decoder (codec) block diagram. They use the fixedtransform matrixes, i.e. Discrete cosine transform (DCT) andwavelet transform, together with quantization and entropycoder to compress the image. However, they are not expectedto be an optimal and flexible image coding solution for alltypes of image content and image formats.

Deep learning has been successfully applied in variouscomputer vision tasks and has the potential to enhance theperformance of image compression. Especially, the autoen-coder has been applied in dimensionality reduction, compactrepresentations of images, and generative models learning [3].Thus, autoencoders are able to extract more compressed codesfrom images with a minimized loss function, and are expectedto achieve better compression performance than existing im-age compression standards including JPEG and JPEG2000.Another advantage of deep learning is that although thedevelopment and standardization of a conventional codechas historically taken years, a deep learning based imagecompression approach can be much quicker with new mediacontents and new media formats, such as 360-degree imageand virtual reality (VR) [4]. Therefore, deep learning basedimage compression is expected to be more general and moreefficient.

Recently, some approaches have been proposed to takeadvantage of the autoencoder for image compression. Due tothe inherent non-differentiability of round-based quantization,

a quantizer cannot be directly incorporated into autoencoderoptimization. Thus, the works [4] and [5] proposed a differen-tiable approximation for quantization and entropy rate estima-tion for an end-to-end training with gradient backpropagation.Unlike those works, the work [6] used an LSTM recurrentnetwork for compressing small thumbnail images (32 × 32),and used a binarization layer to replace the quantization andentropy coder. This approach was further extended in [7]for compressing full-resolution images. These works achievedpromising coding performance; however, there is still roomfor improvement, because they did not analyze the energycompaction property of the generated feature maps and didnot use a real entropy coder to generate the final codes.

In this paper, we propose a convolutional autoencoder(CAE) based lossy image compression architecture. Our maincontributions are twofold.

1) To replace the transform and inverse transform in tradi-tional codecs, we design a symmetric CAE structure withmultiple downsampling and upsampling units to generatefeature maps with low dimensions. We optimize this CAEusing an approximated rate-distortion loss function.

2) To generate a more energy-compact representation, wepropose a principal components analysis (PCA)-basedrotation to generate more zeros in the feature maps.Then, the quantization and entropy coder are utilized tocompress the data further.

Experimental results demonstrate that our method outperformsJPEG and JPEG2000 in terms of PSNR, and achieves a 13.7%BD-rate decrement compared to JPEG2000 with the popularKodak database images. In addition, our method is computa-tionally more appealing compared to other autoencoder basedimage compression methods.

The rest of this paper is organized as follows. Section IIpresents the proposed CAE based image compression architec-ture, which includes the design of the CAE network architec-ture, quantization, and entropy coder. Section III summarizesthe experimental results and compares the rate-distortion (RD)curves of the proposed CAE with those of existing codecs.Conclusion and future work are given in Section IV.

II. PROPOSED CONVOLUTIONAL AUTOENCODER BASEDIMAGE COMPRESSION

The block diagram of the proposed image compressionbased on CAE is illustrated in Fig.1. The encoder part includesthe pre-processing steps, CAE computation, PCA rotation,

arX

iv:1

804.

0953

5v1

[cs

.CV

] 2

5 A

pr 2

018

Fig. 1: Block diagram of the proposed CAE based image compression. (The detailed block for downsampling/upsampling isshown in Fig. 2)

quantization, and entropy coder. The decoder part mirrors thearchitecture of the encoder.

To build an effective codec for image compression, we trainthis approach in two stages. First, a symmetric CAE networkis designed using convolution and deconvolution filters. Then,we train this CAE greedily using an RD loss function with anadded uniform noise, which is used to imitate the quantizationnoises during the optimizing process. Second, by analyzingthe produced feature maps from the pre-trained CAE, weutilize the PCA rotation to produce more zeros for improvingthe coding efficiency further. Subsequently, quantization andentropy coder are used to compress the rotated feature mapsand the side information for PCA (matrix U) to generatethe compressed bitstream. Each of these components will bediscussed in detail in the following.A. CAE Network

As the pre-processing steps before the CAE design, the rawRGB image is mapped to YCbCr images and normalized to[0,1]. For general purposes, we design the CAE for each lumaor chroma component; therefore, the CAE network handlesinputs of size H × W × 1. When the size of raw image islarger than H×W , the image will be split into non-overlappingH ×W patches, which can be compressed independently.

The CAE network can be regarded as an analysis transformwith the encoder function, y = fθ(x), and a synthesistransform with the decoder function, x = gφ(y), where x,x, and y are the original images, reconstructed images, andthe compressed data, respectively. θ and φ are the optimizedparameters in the encoder and decoder, respectively.

To obtain the compressed representation of the input im-ages, downsampling/upsampling operations are required inthe encoding/decoding process of CAE. However, consecutivedownsampling operations will reduce the quality of the recon-structed images. In the work [4], it points out that the superresolution is achieved more efficiently by first convolvingimages and then upsampling them. Therefore, we propose apair of convolution/deconvolution filters for upsampling ordownsampling, as shown in Fig. 2, where Ni denotes thenumber of filters in the convolution or deconvolution block. By

Fig. 2: Downsampling/Upsampling Units with two(De)Convolution Filters.

setting the stride as 2, we can get downsampled feature maps.The padding size is set as one to maintain the same size as theinput. Unlike the work [4], we do not use residual networksand sub-pixel convolutions, instead, we apply deconvolutionfilters to achieve a symmetric and simple CAE network.

In traditional codecs, the quantization is usually imple-mented using the round function (denoted as [·]), and thederivative of the round function is almost zero except at theintegers. Due to the non-differentiable property of roundingfunction, the quantizer cannot be directly incorporated intothe gradient-based optimization process of CAE. Thus, somesmooth approximations are proposed in related works. Theis etal. [4] proposed to replace the derivative in the backward passof back propagation as d

dy ([y]) ≈ 1. Balle et al. [5] replacedthe quantization by an additive uniform noise as [y] ≈ y + µ.Toderici et al. [6] used a stochastic binarization function asb(y) = −1 when y < 0, and b(y) = 1 otherwise. In ourmethod, we use the simple uniform noises intuitively to imitatethe quantization noises during the CAE training. After CAEtraining, we apply the real round-based quantization in thefinal image compression. The network architecture of CAEis shown in Fig. 1, in which Ni denotes the number offilters in each convolution layer and determines the numberof generated feature maps.

As for the activation function in each convolution layer,we utilize the Parametric Rectified Linear Unit (PReLU)function [8], instead of the ReLU which is commonly used in

Fig. 3: The effect of activation function in CAE.

the related works. The performance with ReLU and PReLUfunctions are shown in Fig. 3. Compared to ReLU, PReLU canimprove the quality of the reconstructed images, especially forhigh bit rate. Inspired by the rate-distortion cost function inthe traditional codecs, the loss function of CAE is defined as

J(θ, φ;x) = ||x− x||2 + λ · ||y||2

= ||x− gφ(fθ(x) + µ)||2 + λ · ||fθ(x)||2(1)

where ||x− x||2 denotes the mean square error (MSE) distor-tion between the original images x and reconstructed imagesx. µ is the uniform noise. λ controls the tradeoff betweenthe rate and distortion. ||fθ(x)||2 denotes the amplitude of thecompressed data y, which reflects the number of bits used toencode the compressed data. In this work, the CAE model wasoptimized using Adam [9], and was applied to images withthe size of H ×W . We used a batch size of 16 and trainedthe model up to 8 × 105 iterations, but the model reachedconvergence much earlier. The learning rate was kept at a fixedvalue of 0.0001, and the momentum was set as 0.9 during thetraining process.

B. PCA Rotation, Quantization, and Entropy Coder

After the CAE computation, an image representation witha size of H

8 ×W8 ×N6 is obtained for each H ×W × 1 input,

where N6 denotes the number of filters in the sixth convolutionlayer of the encoder part. Three examples of the feature mapsfor the 512× 512 images cropped from Kodak databases [11]are demonstrated in the second column of Fig. 4. It can beobserved that each feature map can be regarded as one high-level representation of the raw images.

To obtain a more energy-compact representation, we decor-relate each feature map by utilizing the principle componentanalysis (PCA), because PCA is an unsupervised dimensional-ity reduction algorithm and is suitable for learning the reducedfeatures as a supplementary of CAE. The generated featuremaps are denoted as y = H

8 ×W8 ×N6, and y is reshaped as

N6-dimensional data. PCA is performed using the followingsteps. The first step is to compute the covariance matrix of zas follows:

Σ =1

m

m∑1

(y)(y)T (2)

where m is the number of samples for y. The second step isto compute the eigenvectors of Σ and stack the eigenvectors

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 4: Examples of three images and their correspondingfeature maps arranged in raster-scan order (N6 = 32): (a)(d)(g)Raw images, (b)(e)(h) Generated 32 feature maps for Y-component by CAE, and the size of each feature map isH8 ×

W8 , (c)(f)(i) Rotated Y feature maps by PCA, arranged

in vertical scan order.

in columns to form the matrix U . Here, the first column is theprincipal eigenvector corresponding to the largest eigenvalue,the second column is the second eigenvector, and so on. Thethird step is to rotate the N6-dimensional data y by computing

yrot = UT y (3)

By computing yrot, we can ensure that the first feature mapshave the largest value, and the features maps are sorted indescending order. Experimental results demonstrate that thevertical-scan order for the feature maps works a little betterthan diagonal scan and horizontal scan; therefore, we arrangethe feature maps in vertical scan as shown in the third columnof Fig. 4. It can be observed that more zeros are generated inthe bottom-right corner and large values are centered in thetop-left corner in the rotated feature maps, which can benefitthe entropy coder to achieve large compression ratio.

After the PCA rotation, the quantization is performed as

y′ = [2B−1 · yrot] (4)

where B denotes the number of bits for the desired precision,which is set as 12 in our model.

As for the entropy coder, we use the JPEG2000 entropycoder to decompose y′ into bitplanes and apply the adaptivebinary arithmetic coder. It is noted that JPEG2000 entropycoder applies EBCOT (Embedded block coding with opti-mized truncation) algorithm to achieve a desired rate R, whichis also referred to as post-compression RD optimization. In ourmethod, the feature maps rotated by PCA have many zeros;therefore, assigning the target bits R can further improve thecoding efficiency.

In the decoder part, de-quantization is performed as

y =y′

2B−1(5)

After obtaining the float-point number y from the bitstream,we recover the feature maps from the rotated data by using

y = Uy (6)

Then, the CAE decoder network will reconstruct the imagesusing x = gφ(y). The side information of PCA rotation is thematrix U with a dimension of N6 × N6 for each image. Wealso quantize U and encode it. The bits for U is added to thefinal rate as the side information in the experimental results.

III. EXPERIMENTAL RESULTS

A. Experimental Setup

We use a subset of the ImageNet database [10] consistingof 5500 images to train the CAE network. In our experiments,H and W are set as 128; therefore, the images that are input tothe CAE are split to a size of 128×128 patches. The numbersof filters, i.e. Ni, i ∈ [1, 6] in convolutional layers are set as{32, 32, 64, 64, 64, 32}, respectively. The decoder part mirrorsthe encoder part. The luma component is used to train the CAEnetwork. Mean square error is used in the loss function duringthe training process in order to measure the distortion betweenthe reconstructed images and original images. For testing, weuse the commonly used Kodak lossless image database [11]with 24 uncompressed 768×512 or 512×768 images. In ourCAE training process, λ is set as one and the uniform noiseµ is set as [− 1

210 ,1

210 ].In order to measure the coding efficiency of the proposed

CAE-based image compression method, the rate is measuredin terms of bit per pixel (bpp). The quality of the reconstructedimages is measured using the quality metrics PSNR and MS-SSIM [12], which measure the objective quality and perceivedquality, respectively.

B. Coding Efficiency Performance

We compare our CAE-based image compression with JPEGand JPEG2000. The color space in this experiment is YUV444.Since the human visual system is more sensitive to the lumacomponent than chroma components, it is common to assignthe weights 6

8 , 18 , and 1

8 to the Y, Cb, and Cr components,respectively. The RD curves for the images red door anda girl are shown in Fig. 5. The coding efficiency of CAEis better than those of both JPEG2000 and JPEG in termsof PSNR. In terms of MS-SSIM, CAE is better than JPEGand comparable with JPEG2000, because optimizing MSE inCAE training leads to better PSNR characteristic, but not MS-SSIM. Besides, CAE handles a fixed input size of 128× 128;therefore, block boundary artifacts appear in some images. It isexpected that adding perceptual quality matrices into the lossfunction will improve the MS-SSIM performance, which willbe carried out in our future work. Examples of reconstructedpatches are shown in Fig. 6. We can observe that the subjectivequality of the reconstructed images for CAE is better thanJPEG and comparable with that of JPEG2000.

Fig. 5: RD curves of color images for the proposed CAE,JPEG, and JPEG2000

24bpp 0.290bpp 0.297bpp 0.293bpp

24bpp 0.283bpp 0.300bpp 0.294bpp

24bpp 0.318bpp 0.299bpp 0.295bpp(a) Raw (b) JPEG (c) JPEG2000 (d) CAE

Fig. 6: Examples of raw image (a) and reconstructed images(300 × 300) cropped from Kodak images using (b)JPEG,(c)JPEG2000 and (d)CAE.

The rate-distortion performance can be evaluated quantita-tively in terms of the average coding efficiency differences,BD-rate (%) [13]. While calculating the BD-rate, the rate isvaried from 0.12bpp to 2.4bpp and the quality is evaluatedby using PSNR. With JPEG2000 as the benchmark, the BD-rate results for 24 images in the Kodak database are listed inFig. 7. On average, for the 24 images in the Kodak database,our method achieves 13.7% BD-rate saving compared toJPEG2000.

We also compare our proposed CAE-based method withBalle’s work, which released the source code for gray im-

Fig. 7: BD-rate of the proposed CAE with JPEG2000 as thebenchmark.

(a) (b)

Fig. 8: RD curves of gray images for our proposed CAE andBalle’s work.

ages [5]. For a fair comparison, we give the comparison resultsfor gray images. For Balle’s work, the rate is estimated by theentropy of the discrete probability distribution of the quantizedvector, which is the lower bound of the rate. In our work,the rate is calculated by the real file size (kb) divided by theresolution of the tested images. Two examples of RD curvesare shown in Fig. 8. Our method exhibits better RD curvesthan Balle’s work for some test images, such as Fig. 8(a),but exhibits slightly worse RD performance for some images,such as Fig. 8(b). On average, the performance of our proposedmethod CAE is comparable with Balle’s work, even thoughthe CAE used an actual entropy coder against the ideal entropyof Balle’s work.

C. Complexity Performance

Our experiments are performed on a PC with 4.20 GHz IntelCore i7-7700K CPU, 16GB RAM and GeForce GTX 1080GPU. The pre-processing steps for the images and Balle’scodec [5] are implemented using Matlab script in MatlabR2016b environment. The codecs of JPEG and JPEG2000can be found from [14] and [15], implemented with CPU.Balle released only their CPU implementation. Running timerefers to one complete encoder and decoder process for onecolor image with a resolution of 768×512, while Balle’s timerefers to the gray image. The running time comparison foreach image for different image compression methods is listedin Table I. It can be observed that our CAE-based methodachieves lower complexity than Balle’s method [5] when it isrun by the CPU, because we have designed a relatively simpleCAE architecture. Besides, with GPU implementation, our

method could achieve comparable complexity with those ofJPEG and JPEG2000, which are implemented by C language.Thus, it proves that our method has relatively low complexity.

TABLE I: Average running time comparison.

Codec Time (s)JPEG 0.39JPEG2000 0.59Balle’s work[5] with CPU 7.39Propose CAE with CPU 2.29Propose CAE with GPU 0.67

IV. CONCLUSION AND FUTURE WORKIn this paper, we proposed a convolutional autoencoder

based image compression architecture. First, a symmetric CAEarchitecture with multiple downsampling and upsampling unitswas designed to replace the conventional transforms. Thenthis CAE was trained by using an approximated rate-distortionfunction to achieve high coding efficiency. Second, we appliedthe PCA to the feature maps for a more energy-compactrepresentation, which can benefit the quantization and entropycoder to improve the coding efficiency further. Experimentalresults demonstrate that our method outperforms conventionaltraditional image coding algorithms and achieves a 13.7%BD-rate decrement compared to JPEG2000 on the Kodakdatabase images. In our future work, we will add perceptualquality matrices, such as MS-SSIM or the quality predictedby neural networks in [16], into the loss function to improvethe MS-SSIM performance. Besides, the generative adversarialnetwork (GAN) shows more promising performance thanusing autoencoder only; therefore, we will utilize GAN toimprove the coding efficiency further.

REFERENCES[1] G. K Wallace, “The JPEG still picture compression standard”, IEEE

Trans. on Consumer Electronics, vol. 38, no. 1, pp. 43-59, Feb. 1991.[2] Majid Rabbani, Rajan Joshi, “An overview of the JPEG2000 still image

compression standard”, ELSEVIER Signal Processing: Image Commu-nication, vol. 17, no, 1, pp. 3-48, Jan. 2002.

[3] P. Vincent, H. Larochelle, Y. Bengio and P.-A. Manzagol, “Extractingand composing robust features with denoising autoencoders”, Intl. conf.on Machine Learning (ICML), pp. 1096-1103, July 5-9. 2008.

[4] Lucas Theis, Wenzhe Shi, Andrew Cunninghan and Ferenc Huszar,“Lossy Image Compression with Compressive Autoencoders”, Intl. Conf.on Learning Representations (ICLR), pp. 1-19, April 24-26, 2017.

[5] Johannes Balle, Valero Laparra, Eero P. Simoncelli, “End-to-End Optimized Image Compression”, Intl. Conf. onLearning Representations (ICLR), pp. 1-27, April 24-26, 2017.http://www.cns.nyu.edu/∼lcv/iclr2017/

[6] G. Toderici, S. M.O’Malley, S. J. Hwang, et al., “Variable rate imagecompression with recurrent neural networks”, arXiv: 1511.06085, 2015.

[7] G, Toderici, D. Vincent, N. Johnson, et al., “Full Resolution ImageCompression with Recurrent Neural Networks”, IEEE Conf. on ComputerVision and Pattern Recognition (CVPR), pp. 1-9, July 21-26, 2017.

[8] K. He, X. Zhang, S. Ren and J. Sun, “Delving Deep into Rectifiers: Sur-passing Human-Level Performance on ImageNet Classification”, IEEEIntl. Conf. on Computer Vision (ICCV), pp. 1026-1034, Santiago, 2015.

[9] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”,arXiv:1412.6980, pp.1-15, Dec. 2014.

[10] J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database”, IEEE Conf. on ComputerVision and Pattern Recognition, pp. 1-8, June 20-25, 2009.

[11] Kodak Lossless True Color Image Suite, Download fromhttp://r0k.us/graphics/kodak/

[12] Z. Wang, E. P. Simoncelli and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment”, The 36-th Asilomar Conferenceon Signals, Systems and Computers, Vol.2, pp. 1398-1402, Nov. 2013.

[13] G. Bjontegaard, “Calculation of Average PSNR Differences betweenRDcurves”, ITU-T VCEG, Document VCEG-M33, Apr. 2001.

http://www.cns.nyu.edu/~lcv/iclr2017/

http://arxiv.org/abs/1412.6980

http://r0k.us/graphics/kodak/

[14] JPEG official software libjpeg, https://jpeg.org/jpeg/software.html[15] JPEG2000 official software OpenJPEG,

https://jpeg.org/jpeg2000/software.html[16] Z. Cheng, M. Takeuchi, J. Katto, “A Pre-Saliency Map Based Blind

Image Quality Assessment via Convolutional Neural Networks”, IEEEIntl. Symposium on Multimedia, pp. 1-6, Dec. 11-13, 2017.

Documents

Deep Convolutional AutoEncoder-based Lossy …Deep Convolutional AutoEncoder-based Lossy Image Compression Zhengxue Cheng , Heming Sun, Masaru Takeuchi , and Jiro Katto Graduate School