19
BVI-DVC: A Training Database for Deep Video Compression Di Ma, Fan Zhang, and David R. Bull ? Department of Electrical and Electronic Engineering, University of Bristol BS8 1UB, UK {di.ma, fan.zhang, dave.bull}@bristol.ac.uk Abstract. Deep learning methods are increasingly being applied in the optimisation of video compression algorithms and can achieve signif- icantly enhanced coding gains, compared to conventional approaches. Such approaches often employ Convolutional Neural Networks (CNNs) which are trained on databases with relatively limited content cover- age. In this paper, a new extensive and representative video database, BVI-DVC 1 , is presented for training CNN-based coding tools. BVI-DVC contains 800 sequences at various spatial resolutions from 270p to 2160p and has been evaluated on ten existing network architectures for four different coding tools. Experimental results show that the database pro- duces significant improvements in terms of coding gains over three exist- ing (commonly used) image/video training databases, for all tested CNN architectures under the same training and evaluation configurations. Keywords: Video database, CNN training, video compression 1 Introduction From the introduction of the first international standard in 1984, video com- pression has played an essential role in the application and uptake of video technologies across film, television, terrestrial and satellite transmission, surveil- lance and particularly Internet video [1]. Inspired by recent breakthroughs in AI technology, deep learning methods such as Convolutional Neural Networks (CNNs) have been increasingly exploited into video coding algorithms [2] to provide significant coding gains compared to conventional approaches based on classic signal/image processing theory. It is noted that these learning-based compression approaches demand vol- umes of training material much greater than typically used for conventional compression or existing machine learning methods. These should include di- verse content covering different formats and video texture types. As far as we ? The authors acknowledge funding from EPSRC (EP/L016656/1 and EP/M000885/1) and the NVIDIA GPU Seeding Grants. 1 The BVI-DVC database can be downloaded from https://vilab.blogs.bristol. ac.uk/?p=2375. arXiv:2003.13552v1 [eess.IV] 30 Mar 2020

Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep VideoCompression

Di Ma, Fan Zhang, and David R. Bull†?

†Department of Electrical and Electronic Engineering, University of BristolBS8 1UB, UK

{di.ma, fan.zhang, dave.bull}@bristol.ac.uk

Abstract. Deep learning methods are increasingly being applied in theoptimisation of video compression algorithms and can achieve signif-icantly enhanced coding gains, compared to conventional approaches.Such approaches often employ Convolutional Neural Networks (CNNs)which are trained on databases with relatively limited content cover-age. In this paper, a new extensive and representative video database,BVI-DVC1, is presented for training CNN-based coding tools. BVI-DVCcontains 800 sequences at various spatial resolutions from 270p to 2160pand has been evaluated on ten existing network architectures for fourdifferent coding tools. Experimental results show that the database pro-duces significant improvements in terms of coding gains over three exist-ing (commonly used) image/video training databases, for all tested CNNarchitectures under the same training and evaluation configurations.

Keywords: Video database, CNN training, video compression

1 Introduction

From the introduction of the first international standard in 1984, video com-pression has played an essential role in the application and uptake of videotechnologies across film, television, terrestrial and satellite transmission, surveil-lance and particularly Internet video [1]. Inspired by recent breakthroughs inAI technology, deep learning methods such as Convolutional Neural Networks(CNNs) have been increasingly exploited into video coding algorithms [2] toprovide significant coding gains compared to conventional approaches based onclassic signal/image processing theory.

It is noted that these learning-based compression approaches demand vol-umes of training material much greater than typically used for conventionalcompression or existing machine learning methods. These should include di-verse content covering different formats and video texture types. As far as we

? The authors acknowledge funding from EPSRC (EP/L016656/1 andEP/M000885/1) and the NVIDIA GPU Seeding Grants.

1 The BVI-DVC database can be downloaded from https://vilab.blogs.bristol.

ac.uk/?p=2375.

arX

iv:2

003.

1355

2v1

[ee

ss.I

V]

30

Mar

202

0

Page 2: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

2 Di Ma, Fan Zhang, and David R. Bull†

SRCNNFSRCNN

VDSREDSR

SRResNetMSRResNet

DRRNESRResNet

RDNRCAN

-15%-10%-5%0%5%10%15%20%Average BD-Rate (PSNR)

DIV2KREDSCDBVI-DVC

VDSRFSRCNN

SRCNNSRResNet

EDSRMSRResNet

DRRNESRResNet

RDNRCAN

-25%-20%-15%-10%-5%0%5%Average BD-Rate (VMAF)

DIV2KREDSCDBVI-DVC

Fig. 1: Average coding gains for four coding modules obtained using 10 commonlyused network architectures trained on four different databases, including BVI-DVC,DIV2K [3], REDS [4] and CD [5], when they are integrated into HEVC Test Model(HM 16.20). Here, the Bjøntegaard Delta measurement (BD-rate) [6] is used to evaluatethe coding gains based on two quality metrics, PSNR and VMAF [7], benchmarkedon the original HEVC HM 16.20 - negative BD-rate values indicate bit rate savings orcoding gains.

are aware, no such public data sets for this purpose currently exist, and mostlearning-based coding methods [8] have been currently trained on image or videodatabases, which were mainly designed for computer vision applications, e.g.super-resolution. Most of these databases do not provide sufficient content cov-erage and diversity. As a result, the generalisation of networks cannot be ensuredin the context of video coding, and the optimum performance of employed CNNnetworks has not been achieved when trained on these databases.

In this paper, a new video database, referred as BVI-DVC, is proposed fortraining CNN-based video coding algorithms. It contains 800 progressive-scannedvideo clips at a wide range of spatial resolutions from 270p to 2160p, with diverseand representative content. To demonstrate its training effectiveness, comparedto three commonly used training databases [3–5], BVI-DVC has been utilised totrain ten popular CNN architectures [9–18] using four video coding tools: post-processing (PP), in-loop filtering (ILF), spatial resolution adaptation (SRA)and effective bit depth adaptation (EBDA). The resulting CNN-based codingtools were then integrated into the Test Model (HM 16.20) of the current HighEfficiency Video Coding (HEVC) standard [19], and evaluated based on the JointVideo Experts Team (JVET) Common Test Conditions (CTC) [20]. The results,as summarised in Fig. 1, demonstrate the significant coding gain benefits thatare possible with BVI-DVC.

The primary contributions of this paper are:

1. A publicly available video database (BVI-DVC) specifically developed fortraining deep video coding compression algorithms.

2. An analysis of the performance of BVI-DVC in the context of four compres-sion tools compared with three existing (commonly used) databases.

3. A performance comparison of ten popular CNN architectures based on iden-tical training materials and evaluation configurations.

Page 3: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 3

The remainder of the paper is structured as follows. Section 2 summarisesthe related work in deep learning-based video compression and commonly usedimage/video training databases. Section 3 introduces the proposed (BVI-DVC)video training database, while Section 4 presents the experimental configurationsemployed for evaluating the effectiveness of the proposed database. The resultsand discussions are provided in Section 5, and Section 6 concludes the paper andpresents some suggestions for future work.

2 Related Work

With increasing spatial resolutions, higher frame rates, greater dynamic rangeand the requirement for multiple viewpoints, accompanied by dramatic increasesin user numbers, video content has become the primary driver for increasedinternet bandwidth. How we represent and communicate video (via compression)is key in ensuring that the content is delivered at an appropriate quality, whilemaintaining compatibility with the transmission bandwidth.

To address this issue, ISO (MPEG) and ITU initiated the development ofa new coding standard, Versatile Video Coding (VVC) [21] in 2018, targetingincreased coding gain (by 30-50%) compared to the current HEVC standard.Concurrently, the Alliance for Open Media (AOM - formed in 2015) finalisedan open-source, royalty-free media delivery solution (AV1) [22] in 2018, offeringperformance competitive with HEVC [23].

Machine learning based compression based on deep neural networkshas become a popular research topic offering evident enhancements over conven-tional coding tools for: intra prediction [24,25], motion estimation [26,27], trans-forms [28, 29], quantisation [30], entropy coding [31, 32], post-processing [33, 34]and loop filtering [35, 36]. New coding tools such as format adaptation [37, 38]and virtual reference frame optimisation [27] have also been reported. Otherwork has implemented a complete coding framework based on neural networksusing end-to-end training and optimisation [39–41]. It is noted that, despite theirperformance benefits, few of these approaches are currently being adopted bythe latest VVC video coding standard. This is due to the high computationalcomplexity and the large GPU memory requirements associated with CNN com-putation.

Training Databases are a critical component for optimising the perfor-mance of machine learning based algorithms. A well designed training databasecan ensure good model generalisation and avoid potential over-fitting prob-lems [42, 43]. As far as we are aware, there is no publicly available databasewhich is specifically designed for learning-based video coding. Researchers, todate, have typically employed training databases developed for other purposes(such as super-resolution, frame interpolation and classification) for training.Notable publicly available image and video training databases are summarisedbelow.

– ImageNet [44] is a large image database primarily designed for visual objectrecognition. It contains more than 14 million RGB images at various spatial

Page 4: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

4 Di Ma, Fan Zhang, and David R. Bull†

resolutions (up to 2848p) covering a wide range of natural content. It hasalso been used as a training database for single image super-resolution [14].

– DIV2K [3] contains 1000 RGB source images with a variety of contenttypes, which was firstly developed for super-resolution. It has currently beenemployed as training material by several JVET proposals [45,46] and manyother CNN-based coding algorithms [47,48].

– BSDS [49] is an image database originally developed for image segmenta-tion. It contains 500 RGB images, and has been used to train CNN-basedloop filters [36] for video coding. Comparing to DIV2K, BSDS has fewersource images and lower spatial resolution (481×321).

– CD (Combined Database) in [5] is a video database combining source con-tent from the LIVE Video Quality Assessment Database [50], MCL-V Database[51] and TUM 1080p Database [52] and has been employed to train CNN-based super-resolution approaches [5]. It contains 29 sequences at two dif-ferent spatial resolutions, 1920×1080 and 768×432.

– REDS [4] is a video database developed for training video super-resolutionalgorithms [53], which contains 300 video clips with spatial resolution 1280×720.

– UCF101 [54] is a large video training database initially designed for humanaction recognition, and has been frequently used for training CNN-basedtemporal frame interpolation and motion prediction approaches [55–57]. Itcontains 13320 videos collected from YouTube, which consist of 101 types ofhuman actions. All the sequences in UCF-101 have a relatively low spatialresolution of 320×240.

Modern video coding algorithms are required to process content with diversetexture types at high spatial resolutions and bit depths. For example, the stan-dard test sequences included in the JVET Common Test Conditions (CTC) dataset include video clips at UHD resolution (2160p) at a bit depth of 10 bits, withvarious static and dynamic textures2. However none of the training databasesmentioned above contain image or video content with high spatial resolution andbit depth, and most do not include any dynamic texture content.

3 The BVI-DVC Video Training Database

Two hundred source sequences were carefully selected from public video databases,including 69 sequences from the Videvo Free Stock Video Footage set [59], 37from the IRIS32 Free 4K Footage set [60], 25 from the Harmonics database [61],19 from BVI-Texture [62], 10 from the MCML 4K video quality database [63],7 from BVI-HFR [64], 7 from the SJTU 4K video database [65], 6 from LIVE-Netflix [66,67], 6 from the Mitch Martinez Free 4K Stock Footage set [68], 5 fromthe Dareful Free 4K Stock Video data set [69], 3 from MCL-V [51], 2 from MCL-JCV [70], 2 from Netflix Chimera [71], 1 from the TUM HD databases [72],

2 Here we follow the definition of textures in [58]. Static textures are associated withrigid patterns undergoing simple movement or subject to camera movement, whiledynamic textures have complex and irregular movements, e.g. water, fire or steam.

Page 5: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 5

Table 1: Key features of seven databases including BVI-DVC.

Features ImageNet [44] DIV2K [3] BSDS [49] CD [5] REDS [4] UCF101 [54] BVI-DVC

Image or Video? Image Image Image Video Video Video Video

Seq Number 14M 1000 500 29 300 13320 800

Max Resolution 2848p 1152p 321p 1080p 720p 240p 2160p

Bit depth 8 8 8 8 8 8 10

Various textures? No No No No No No Yes

(a) Animal (b) Wood (c) Leaves (d) Mountain (e) Myanmar

(f) Venice (g) Tall Buildings (h) Traffics (i) Market (j) Ferris Wheel

(k) Room (l) Store (m) Bookcase (n) Toy (o) Scarf

(p) Cross Walk (q) Plasma (r) Firewood (s) Smoke (t) Water

Fig. 2: Sample frames of 20 example sequences from the BVI-DVC database.

and 1 from the Ultra Video Group-Tampere University database [73]. Thesesequences contain natural scenes and objects [4], e.g. mountains, oceans, ani-mals, grass, trees, countryside, city streets, towns, buildings, institutes, facilities,parks, marketplaces, historical places, vehicles and colorful textured fabrics. Dif-ferent texture types such as static texture, dynamic texture2, structure contentand luminance-plain content are also included.

All these sequences are progressive-scanned at a spatial resolution of 3840×2160,with frame rates ranging from 24 fps to 120 fps, a bit depth of 10 bit, and inYCbCr 4:2:0 format. All are truncated to 64 frames without scene cuts, usingthe segmentation method described in [74]. To further increase data diversityand provide data augmentation, the 200 video clips were spatially down-sampleto 1920×1080, 960×540 and 480×270 using a Lanczos filter of order 3. Thisresults in 800 sequences at four different resolutions. Fig. 2 shows the sampleframes of twenty example sequences. The primary features of this database aresummarised in Table 1 alongside those for the other six databases [3–5,44,49,54]mentioned above.

Page 6: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

6 Di Ma, Fan Zhang, and David R. Bull†

0 10 20 30TI

0

20

40

60

80

100

120

140

SI

(a) SI vs. TI

0 50 100CF

0

20

40

60

80

100

120

140

SI

(b) SI vs. CF

Fig. 3: Scatter plots of three video features for 200 UHD source sequences in the BVI-DVC database.

Three low-level video features of all 200 UHD source sequences, spatial infor-mation (SI), temporal information (TI) and colourfulness (CF) have been alsocalculated and plotted in Fig. 3. The definitions of these feature can be foundin [64,75]. It can be noted that the BVI-DVC database has a relatively wide cov-erage for these three video features, which indicates the diversity of the proposeddatabase.

4 Experiments

In order to evaluate the training effectiveness of the proposed BVI-DVC databasein the context of video compression, ten network architectures [9–18] were em-ployed for four CNN-based coding modules: post processing (PP), in-loop filter-ing (ILF), spatial resolution adaptation (SRA) and effective bit depth adaptation(EBDA).

In terms of benchmarking databases, based on the limited time and resourceavailable, we have chosen one image (DIV2K) and two video databases (REDSand CD) to compare with BVI-DVC. The DIV2K is selected because it hasbeen used for training CNN models in multiple JVET contributions and manyother CNN-based coding algorithms. Comparing to other video databases suchas BSDS and UCF101, REDS and CD contain relatively higher resolution con-tent and have been employed for training successful CNN-based super-resolutionapproaches.

4.1 Coding Modules

Coding Module 1 (Post Processing - PP): The coding workflow for postprocessing (PP) is illustrated in Fig. 4. PP is commonly applied at the decoder,on the reconstructed video frames, to reduce compression artefacts and enhancevideo quality. When a CNN-based approach is employed, the network takes eachdecoded frame as input and outputs the final reconstructed frame with the same

Page 7: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 7

format. Notable examples of employing CNN-based post processing for videocoding can be found in [33,34].

Input Video

Bitstream

Decoded

Video

Final Reconstructed

Video

Host Encoder Host Decoder

CNN-based

Post

Processing

Fig. 4: Coding workflow with a CNN-based post processing module.

Coding Module 2 (In-loop Filtering - ILF): In-loop filtering applies pro-cessing at both the encoder and the decoder on the reconstructed frames, andthe output can be used as reference for further encoding/decoding. An encoderarchitecture with a CNN-based ILF module is shown in Fig. 5. The input andthe output of the CNN-based ILF are the same as those for PP [35,36].

Transform &

QuantisationBitstream++

Entropy

Coding

Input Video

Reconstructed

Frames

Intra/Inter

Reference

Memory

Intra

Prediction

-

Inverse

Quantisation &

Transform

DBF & SAO

Filters

CNN-based

In-loop Filter

Inverse

Quantisation &

Transform

DBF & SAO

Filters

CNN-based

In-loop Filter

Motion

Estimation

Motion

Compensation

Fig. 5: Coding workflow with a CNN-based in-loop filtering module (the modules inthe dashed box form the corresponding decoder). Here, DBF stands for de-blockingfiltering, and SAO is denoted as sample adaptive offset.

Page 8: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

8 Di Ma, Fan Zhang, and David R. Bull†

Coding Module 3 (Spatial Resolution Adaptation - SRA): CNN-basedspatial resolution adaptation (SRA) down-samples the spatial resolution of theoriginal video frames for encoding, and reconstructs the full resolution duringdecoding through CNN-based super-resolution. This approach can be applied atCoding Tree Unit level [37] or to the whole frame. Here we only implementedframe-level SRA [76], as shown in Fig. 6. In this case, the original video framesare spatially down-sampled by a fixed factor of 2, using the Lanczos3 filter.The CNN-based super-resolution module processes the compressed and down-sampled video frames at the decoder to generate full resolution reconstructedframes. It is noted that a nearest neighbour filter is firstly applied to the recon-structed down-sampled video frame before CNN operation [38].

Input Video

Spatial Down-

sampling

(Lanczos3 Filter)Host Encoder

Bitstream

CNN-based

Super

Resolution

Final Reconstructed

Video (Full Resolution)

Decoded Low

Resolution Video

Low Resolution

Video

Host Decoder

Fig. 6: Coding workflow with a CNN-based spatial resolution adaptation module.

Coding Module 4 (Effective Bit Depth Adaptation - EBDA): Similarto the case for spatial resolution, bit depth can also be adapted during encodingin order to achieve improved coding efficiency. Here Effective Bit Depth (EBD)is defined as the actual bit depth used to represent the video content, whichmay be different from the Coding Bit Depth (CBD) that represents the pixelbit depth, e.g. InternalBitDepth in HEVC reference encoders. This process isdemonstrated in Fig. 7. In this paper, we have fixed CBD at 10 bits, the sameas in the Main10 profile of HEVC, and only down-sampled the original framesby 1 bit through bit-shifting. At the decoder, the bit depth of decoded framesis up-sampled to 10 bits using a CNN-based approach. More information aboutEBDA can be found in [77].

4.2 Employed CNN Models

For these four coding modules, ten popular network architectures have beenimplemented for evaluation. These include five with residual blocks, two withresidual dense blocks, and three without any residual block structure. Most of

Page 9: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 9

Input Video

(EBD=CBD=10bit)

Bit Depth

Down-sampling

(Bit-shifting)

Bitstream

Host Encoder

(CBD=10bit,

EBD=9bit)

Host Decoder

(CBD=10bit,

EBD=9bit)

Bit Depth Down-

sampled Video

(EBD=9bit)

Decoded Video

(EBD=9bit)

CNN-based

Bit Depth

Up-sampling

Final Reconstructed Video

(EBD=CBD=10bit)

Fig. 7: Coding workflow with a CNN-based effective bit depth adaptation module.

these network structures were initially designed for super-resolution processingor image enhancement, and some have been employed in CNN-based codingapproaches as described in Section 2. Their primary features are briefly describedbelow:

– SRCNN [9] is the first CNN model designed for single image super resolu-tion (SISR). It employs a simple network structure with only 3 convolutionallayers.

– FSRCNN [10] was also developed for SISR, containing 8 convolutional lay-ers with various kernel sizes.

– VDSR [11] contains 20 convolutional layers employing global residual learn-ing to achieve enhanced performance. However it does not employ resid-ual blocks [78], which may lead to unstable training and evaluation perfor-mance [12].

– SRResNet [14] was the first network structure with residual blocks designedfor SISR, improving the overall performance and stability of the network.

– DRRN [12] employs a recursive structure and also contains residual blocks.– EDSR [13] significantly increases the number of feature maps (256) for

the convolutional layers in each residual block. It offers improved overallperformance but with much higher computational complexity.

– RDN [17] was the first network architecture to combine residual block anddense connections [79] for SISR.

– ESRResNet [15] enhances SRResNet by combining residual blocks withdense connections, and employs residual learning at multiple levels. It alsoremoves the batch normalisation (BN) layer used in SRResNet to furtherstabilise training and reduce artefacts.

– RCAN [16] incorporates a channel attention (CA) scheme in the CNN,which better recovers high frequency texture details.

– MSRResNet [18] modified SRResNet by removing the BN layers for all theresidual blocks. This network structure has been employed in several CNN-based coding algorithms and has been reported to offer significant codinggains [38].

Page 10: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

10 Di Ma, Fan Zhang, and David R. Bull†

In this experiment, we have employed identical architectures for these tennetworks, as reported in their original publications, and only modify the inputand output interfaces in order to process content in the appropriate format. Theinput of all CNNs employed is a 96×96 YCbCr 4:4:4 colour image, while theoutput targets the corresponding original image block with the same size. Theinput block can be either compressed (for PP and ILF), compressed and EBDdown-sampled (for EBDA) or compressed and spatial resolution re-sampled (forSRA - a nearest neighbour filter is applied before CNN processing). The sameloss functions have been used as in the corresponding literature. All these tennetworks have been re-implemented using the TensorFlow framework (version1.8.0).

4.3 Training Data

Three existing image and video databases are selected to benchmark the trainingeffectiveness of BVI-DVC, including DIV2K [3], REDS [4] and CD [5]. All theoriginal images or videos in each database were first spatially down-sampled by afactor of 2 using a Lanczos3 filter or down-sampled by 1 bit through bit-shifting.The original content (for training PP and ILF CNNs), together with spatiallydown-sampled clips (for training SRA CNNs) and bit depth reduced sequences(for training EBDA CNNs) were then compressed by the HEVC Test Model(HM 16.20) based on the JVET Common Test Conditions (CTC) [20] using theRandom Access configuration (Main10 profile) with four base QP (quantisationparameter) values: 22, 27, 32 and 373 (a fixed QP offset of -6 is applied for bothspatially and bit depth down-sampled cases as in [80]). This results in threetraining input content groups for every database, each of which contains fourQP sub-groups. For the input content group with reduced spatial resolution, anearest neighbour filter was applied to obtain video frames with the same sizeas the original content.

For each input group and QP sub-group, the video frames of all reconstructedsequences and their original counterparts were randomly selected (with the samespatial and temporal sampling rates) and split into 96×96 image blocks, whichwere then converted to YCbCr 4:4:4 format. Block rotation was also applied herefor data augmentation.

4.4 Network Training and Evaluation

The training process was conducted using the following parameters: Adam opti-misation [81] with the following hyper-parameters: β1=0.9 and β2=0.999; batchsize of 16; 200 training epochs; learning rate (0.0001); weight decay of 0.1 forevery 100 epochs. This generates 480 CNN models for 4 training databases, 3

3 Here results with four QP values are generated due to the limited time and resourcegiven. During evaluation, if the base QP is different from these four, the CNN modelfor the closest QP value will be used.

Page 11: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 11

Table 2: Evaluation results for PP coding module for ten tested network architecturesand four different training databases. Values indicate the average BD-rate (%) for allnineteen JVET CTC tested sequences assessed by PSNR or VMAF.

CNN Model (PP)DIV2K [3] REDS [4] CD [5] BVI-DVC

BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate(PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF)

SRCNN [9] 0.4 -2.8 2.5 -3.8 11.0 -6.2 -1.9 -7.4

FSRCNN [10] 0.7 -2.6 3.2 -1.2 24.4 2.6 -1.6 -7.3

VDSR [11] 0.3 -2.9 2.3 -4.0 3.1 -2.2 -1.9 -7.6

DRRN [12] -5.0 -6.1 -4.7 -8.2 0.4 -1.1 -10.8 -14.9

EDSR [13] -5.4 -4.9 -3.1 -6.1 -0.8 -6.5 -10.0 -14.6

SRResNet [14] -5.3 -5.4 -4.0 -9.0 4.0 -3.8 -9.8 -12.7

ESRResNet [15] -6.9 -6.7 -6.1 -9.4 -3.5 -9.1 -11.8 -17.7

RCAN [16] -6.6 -7.3 -6.3 -9.7 -4.5 -11.0 -12.1 -18.5

RDN [17] -7.0 -7.2 -6.9 -10.6 -4.6 -10.9 -12.2 -17.0

MSRResNet [18] -6.4 -6.5 -5.3 -9.2 -2.6 -8.7 -10.4 -14.2

input content groups (PP and ILF use the same CNN models), 4 QP sub-groupsand 10 tested network architectures.

During the evaluation stage, for a specific coding module, the decoded videoframes (already up-sampled to the same spatial resolution if needed) are firstlysegmented into 96×96 overlapping blocks with an overlap size of 4 pixels asCNN input (YCbCr 4:4:4 conversion). The output blocks are then aggregatedfollowing the same pattern and then converted to YCbCr 4:2:0 format to formthe final reconstructed frame.

5 Results and Discussions

Four different coding modules have been integrated into the HEVC (HM 16.20)reference software, and have been fully tested under JVET CTC [20] usingthe Random Access configuration (Main10 profile). Nineteen JVET-CTC SDR(standard dynamic range) video sequences from resolution classes A1, A2, B, Cand D were employed as test content, none of which were included in any of thethree training databases. It is noted that only class A1 and A2 (2160p) were usedto evaluate SRA coding module, as it has been previously reported [38] that forlower resolutions SRA may provide limited and inconsistent coding gains.

The rate quality performance (coding performance) is benchmarked againstthe original HEVC HM 16.20, using Bjøntegaard Delta measurement (BD-rate)[6] based on two quality metrics, Peak Signal-to-Noise Ratio (PSNR, Y-luminancechannel only) and Video Multimethod Assessment Fusion (VMAF, version 0.6.1)[7]. PSNR is the most commonly used assessment method for video compression,while VMAF is a machine learning based quality metric commercially used in

Page 12: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

12 Di Ma, Fan Zhang, and David R. Bull†

Table 3: Evaluation results for ILF coding module for ten tested network architecturesand four different databases. Each value indicates the average BD-rate (%) for allnineteen JVET CTC tested sequences assessed by PSNR or VMAF.

CNN Model (ILF)DIV2K [3] REDS [4] CD [5] BVI-DVC

BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate(PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF)

SRCNN [9] -0.2 -2.6 -0.6 -2.5 -0.1 -1.4 -1.4 -8.5

FSRCNN [10] -0.1 -2.2 -0.2 -1.1 0.0 -0.5 -1.3 -8.1

VDSR [11] -1.0 -1.1 -0.7 -2.7 0.0 -0.5 -2.2 -6.5

DRRN [12] -4.0 -5.6 -3.1 -4.6 0.4 -1.1 -6.8 -11.0

EDSR [13] -4.5 -6.1 -2.7 -3.1 -1.3 -4.0 -5.9 -9.9

SRResNet [14] -5.1 -8.6 -2.9 -4.0 -1.1 -2.8 -6.4 -10.6

ESRResNet [15] -5.8 -8.6 -3.3 -5.1 -2.5 -6.8 -7.3 -12.0

RCAN [16] -5.4 -8.5 -6.3 -9.7 -3.0 -8.5 -7.4 -11.4

RDN [17] -5.8 -8.8 -3.7 -5.6 -3.0 -8.9 -7.5 -11.8

MSRResNet [18] -5.6 -9.4 -4.6 -6.7 -2.0 -6.5 -6.4 -11.3

Average BD-rate (PSNR) of 4 databases

CD REDS DIV2K BVI-DVC

-10%

-8%

-6%

-4%

-2%

0%

2%

Ave

rage

BD

-rat

e (P

SN

R)

Average BD-rate (VMAF) of 4 databases

CD DIV2K REDS BVI-DVC

-18%

-16%

-14%

-12%

-10%

-8%

Ave

rage

BD

-rat

e (V

MA

F)

Fig. 8: Average BD-rate (based on PSNR and VMAF) of four tested training databasesfor all the evaluated coding modules and CNN architectures.

streaming. VMAF combines multiple quality metrics and video features using aSupport Vector Machine (SVM) regressor. It has been reported to offer bettercorrelation with subjective opinions [82]. BD-rate statistics indicate the overallbit rate savings across the tested QP range achieved by the test algorithm forthe same video quality compared to the anchor approach. The average BD-ratevalues are reported in Table 2-5 for each evaluated training database, networkarchitecture and coding module.

5.1 Comparison of Databases

It can be observed from Table 2-5 that, for all tested network architecturesand coding modules, the coding gains (in terms of average BD-rates for alltested sequences) achieved after training on the proposed BVI-DVC databaseare significantly greater than for the other three benchmark databases (DIV2K,

Page 13: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 13

Table 4: Evaluation results for SRA coding module for ten tested network architecturesand four different databases.. Each value indicates the average BD-rate (%) for all sixUHD JVET CTC tested sequences assessed by PSNR or VMAF.

CNN Model (SRA)DIV2K [3] REDS [4] CD [5] BVI-DVC

BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate(PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF)

SRCNN [9] 3.9 -11.9 8.6 -19.6 6.6 -12.4 -3.1 -21.1

FSRCNN [10] 1.1 -12.3 -0.3 -18.0 9.9 -9.8 -4.5 -20.9

VDSR [11] 4.4 -11.5 4.3 -15.9 25.9 7.2 -6.6 -18.3

DRRN [12] -8.5 -17.6 -7.8 -26.1 -7.2 -22.1 -15.0 -33.2

EDSR [13] -6.4 -16.3 -6.9 -26.1 -3.2 -20.4 -13.4 -30.1

SRResNet [14] -6.7 -11.5 -7.0 -28.1 -5.5 -19.8 -13.2 -30.0

ESRResNet [15] -9.9 -19.4 -9.9 -31.7 -7.8 -23.5 -16.1 -33.6

RCAN [16] -10.2 -19.3 -10.9 -32.2 -8.4 -23.2 -17.1 -35.1

RDN [17] -10.0 -19.1 -9.7 -31.4 -8.4 -22.7 -16.6 -34.5

MSRResNet [18] -9.2 -18.9 -8.5 -29.9 -7.1 -22.8 -14.6 -32.7

VDSRFSRCNN

SRCNNSRResNet

EDSRDRRN

MSRResNetESRResNet

RDNRCAN

-8%-6%-4%-2%0%2%4%6%Average BD-rate (PSNR) of 10 CNN models

VDSRFSRCNN

SRCNNSRResNet

EDSRDRRN

MSRResNetESRResNet

RDNRCAN

-16%-14%-12%-10%-8%-6%-4%Average BD-rate (VMAF) of 10 CNN models

Fig. 9: Average BD-rate (based on PSNR and VMAF) of 10 test network architecturesfor all coding modules and training databases.

REDS and CD) for both PSNR and VMAF quality metrics. This is reinforcedby considering the mean (among ten networks and four coding modules) of allthe average BD-rates for each database; Fig. 8 shows in excess of 4.2% and 5.4%additional bit rate savings obtained by using BVI-DVC compared to the otherthree databases based on the PSNR and VMAF quality metrics respectively.CD offers the worse overall performance, especially for results based on theassessment of PSNR.

5.2 Comparison of Networks

We can also compare the ten evaluated network architectures under fair configu-rations (identical training and evaluation databases). The results in Table 2-5 aresummarised, by taking the mean (among four training databases and four codingmodules) of average BD-rate values for each network architecture, in Fig. 9. It

Page 14: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

14 Di Ma, Fan Zhang, and David R. Bull†

Table 5: Evaluation results for EBDA coding module for ten tested network architec-tures and four different databases. Each value indicates the average BD-rate (%) forall nineteen JVET CTC tested sequences assessed by PSNR or VMAF.

CNN Model (EBDA)DIV2K [3] REDS [4] CD [5] BVI-DVC

BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate BD-rate(PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF) (PSNR) (VMAF)

SRCNN [9] -0.1 -7.6 2.6 -7.0 11.9 -10.0 -2.1 -11.0

FSRCNN [10] 0.1 -6.7 3.7 -5.2 28.0 -0.4 -2.9 -11.5

VDSR [11] 0.93 -6.8 19.6 -4.5 23.0 -0.13 -5.6 -9.1

DRRN [12] -6.0 -10.8 -3.7 -9.8 -1.1 -11.3 -11.8 -18.4

EDSR [13] -6.1 -11.5 -3.6 -11.1 -0.2 -9.6 -10.3 -17.7

SRResNet [14] -5.9 -10.4 0.5 -9.2 2.1 -7.0 -10.5 -15.9

ESRResNet [15] -7.1 -11.3 -4.1 -11.0 -2.0 -13.8 -12.0 -19.0

RCAN [16] -7.6 -11.0 -5.2 -11.7 -1.4 -12.3 -12.5 -19.8

RDN [17] -7.7 -11.7 -5.6 -10.6 -1.5 -13.7 -12.1 -19.1

MSRResNet [18] -7.0 -11.2 -4.1 -11.7 -2.5 -13.9 -11.1 -17.9

can be observed that RCAN, RDN, ESRResNet and MSRResNet offer bettercoding performance (for both PSNR and VMAF) than the other six evaluatednetwork architectures. This is likely to be because of the residual block structureemployed. The coding gains for VDSR, FSRCNN and SRCNN are relatively lowcomparing to other networks, exhibiting coding loss when PSNR is used to as-sess video quality especially when they are trained on the CD database (referto Fig. 1). This may be due to their simple network architecture (FSRCNN andSRCNN) and the large number convolutional layers without residual learningstructure (VDSR), which lead to less stable training and evaluation [16,17].

6 Conclusion

This paper presents a new database (BVI-DVC) specifically designed for trainingCNN-based video compression algorithms. With carefully selected sequences in-cluding diverse content, this database offers significantly improved training effec-tiveness compared to other commonly used image and video training databases.It has been evaluated for four different coding modules with ten typical CNNarchitectures. The BVI-DVC database reported is available online1 for publictesting. Its content diversity makes it a reliable training database, not just forCNN-based compression approaches, but also for other computer vision andimage processing related tasks, such as image/video de-noising, video frame in-terpolation and super-resolution. Future work should focus on developing largetraining databases with more immersive video formats, including higher dynamicrange, higher frame rates and higher spatial resolutions.

Page 15: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 15

References

1. Bull, D.R.: Communicating pictures: A course in Image and Video Coding. Aca-demic Press (2014)

2. Ma, S., Zhang, X., Jia, C., Zhao, Z., Wang, S., Wanga, S.: Image and videocompression with neural networks: A review. IEEE Transactions on Circuits andSystems for Video Technology (2019)

3. Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: Dataset and study. In: The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) Workshops. (July 2017)

4. Nah, S., Baik, S., Hong, S., Moon, G., Son, S., Timofte, R., Mu Lee, K.: NTIRE2019 challenge on video deblurring and super-resolution: Dataset and study. In:Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops. (2019) 0–0

5. Liu, D., Wang, Z., Fan, Y., Liu, X., Wang, Z., Chang, S., Huang, T.: Robust videosuper-resolution with learned temporal dynamics. In: Proceedings of the IEEEInternational Conference on Computer Vision. (2017) 2507–2515

6. Bjøntegaard, G.: Calculation of average PSNR differences between rd-curves. In:13th VCEG Meeting, no. VCEG-M33,Austin, Texas. (2001) USA: ITU–T

7. Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., Manohara, M.: Toward a practicalperceptual video quality metric. The Netflix Tech Blog 6 (2016)

8. Liu, D., Li, Y., Lin, J., Li, H., Wu, F.: Deep learning-based video coding: A reviewand a case study. ACM Computing Surveys (CSUR) 53(1) (2020) 1–35

9. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convo-lutional networks. IEEE transactions on pattern analysis and machine intelligence38(2) (2015) 295–307

10. Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutionalneural network. In: European conference on computer vision, Springer (2016) 391–407

11. Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using verydeep convolutional networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. (2016) 1646–1654

12. Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residualnetwork. In: Proceedings of the IEEE conference on computer vision and patternrecognition. (2017) 3147–3155

13. Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networksfor single image super-resolution. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops. (2017) 136–144

14. Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken,A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEEconference on computer vision and pattern recognition. (2017) 4681–4690

15. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.:ESRGAN: Enhanced super-resolution generative adversarial networks. In: Pro-ceedings of the European Conference on Computer Vision (ECCV). (2018) 0–0

16. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution usingvery deep residual channel attention networks. In: Proceedings of the EuropeanConference on Computer Vision (ECCV). (2018) 286–301

17. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for imagesuper-resolution. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. (2018) 2472–2481

Page 16: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

16 Di Ma, Fan Zhang, and David R. Bull†

18. Ma, D., Afonso, M.F., Zhang, F., Bull, D.R.: Perceptually-inspired super-resolutionof compressed videos. In: Applications of Digital Image Processing XLII. Volume11137., International Society for Optics and Photonics (2019) 310–318

19. ITU-T Rec. H.265: High efficiency video coding. ITU-T Std. (2015).20. Bossen, F., Boyce, J., Li, X., Seregin, V., Suhring, K.: JVET common test condi-

tions and software reference configurations for SDR video. In: the JVET meeting.Number JVET-M1001, ITU-T, ISO/IEC (2019)

21. Bross, B., Chen, J., Liu, S., Wang, Y.K.: Versatile video coding (draft 7). In:JVET-P2001. ITU-T and ISO/IEC. (2019)

22. AOMedia Video 1 (AV1). https://aomedia.googlesource.com/23. Katsenou, A.V., Zhang, F., Afonso, M., Bull, D.R.: A subjective comparison of AV1

and HEVC for adaptive video streaming. In: 2019 IEEE International Conferenceon Image Processing (ICIP), IEEE (2019) 4145–4149

24. Yeh, C.H., Zhang, Z.T., Chen, M.J., Lin, C.Y.: HEVC intra frame coding basedon convolutional neural network. IEEE Access 6 (2018) 50087–50095

25. Li, J., Li, B., Xu, J., Xiong, R., Gao, W.: Fully connected network-based intraprediction for image coding. IEEE Transactions on Image Processing 27(7) (2018)3236–3247

26. Zhao, Z., Wang, S., Wang, S., Zhang, X., Ma, S., Yang, J.: Enhanced bi-predictionwith convolutional neural network for high efficiency video coding. IEEE Transac-tions on Circuits and Systems for Video Technology (2018)

27. Zhao, L., Wang, S., Zhang, X., Wang, S., Ma, S., Gao, W.: Enhanced motion-compensated video coding with deep virtual reference frame generation. IEEETransactions on Image Processing (2019)

28. Puri, S., Lasserre, S., Le Callet, P.: CNN-based transform index prediction inmultiple transforms framework to assist entropy coding. In: 2017 25th EuropeanSignal Processing Conference (EUSIPCO), IEEE (2017) 798–802

29. Jimbo, S., Wang, J., Yashima, Y.: Deep learning-based transformation matrixestimation for bidirectional interframe prediction. In: 2018 IEEE 7th Global Con-ference on Consumer Electronics (GCCE), IEEE (2018) 726–730

30. Alam, M.M., Nguyen, T.D., Hagan, M.T., Chandler, D.M.: A perceptual quan-tization strategy for HEVC based on a convolutional neural network trained onnatural images. In: Applications of Digital Image Processing XXXVIII. Volume9599., International Society for Optics and Photonics (2015) 959918

31. Song, R., Liu, D., Li, H., Wu, F.: Neural network-based arithmetic coding of intraprediction modes in HEVC. In: 2017 IEEE Visual Communications and ImageProcessing (VCIP), IEEE (2017) 1–4

32. Ma, C., Liu, D., Peng, X., Wu, F.: Convolutional neural network-based arithmeticcoding of dc coefficients for HEVC intra coding. In: 2018 25th IEEE InternationalConference on Image Processing (ICIP), IEEE (2018) 1772–1776

33. Li, C., Song, L., Xie, R., Zhang, W.: CNN based post-processing to improve HEVC.In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE (2017)4577–4580

34. Zhao, H., He, M., Teng, G., Shang, X., Wang, G., Feng, Y.: A CNN-based post-processing algorithm for video coding efficiency improvement. IEEE Access (2019)

35. Li, T., Xu, M., Zhu, C., Yang, R., Wang, Z., Guan, Z.: A deep learning approachfor multi-frame in-loop filter of HEVC. IEEE Transactions on Image Processing28(11) (2019) 5663–5678

36. Jia, C., Wang, S., Zhang, X., Wang, S., Liu, J., Pu, S., Ma, S.: Content-awareconvolutional neural network for in-loop filtering in high efficiency video coding.IEEE Transactions on Image Processing 28(7) (2019) 3343–3356

Page 17: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 17

37. Lin, J., Liu, D., Yang, H., Li, H., Wu, F.: Convolutional neural network-basedblock up-sampling for HEVC. IEEE Transactions on Circuits and Systems forVideo Technology (2018)

38. Zhang, F., Afonso, M., Bull, D.R.: ViSTRA2: Video coding using spatial resolutionand effective bit depth adaptation. arXiv preprint arXiv:1911.02833 (2019)

39. Balle, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression.arXiv preprint arXiv:1611.01704 (2016)

40. Minnen, D., Balle, J., Toderici, G.D.: Joint autoregressive and hierarchical priorsfor learned image compression. In: Advances in Neural Information ProcessingSystems. (2018) 10771–10780

41. Rippel, O., Nair, S., Lew, C., Branson, S., Anderson, A.G., Bourdev, L.: Learnedvideo compression. In: Proceedings of the IEEE International Conference on Com-puter Vision. (2019) 3454–3463

42. Zhu, X., Vondrick, C., Fowlkes, C.C., Ramanan, D.: Do we need more trainingdata? International Journal of Computer Vision 119(1) (Aug 2016) 76–92

43. Tian, Y., Zhang, Y., Fu, Y., Xu, C.: TDAN: Temporally deformable alignmentnetwork for video super-resolution. arXiv preprint arXiv:1812.02898 (2018)

44. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. International journal of computer vision 115(3) (2015) 211–252

45. Wan, S., Wang, M.Z., Gong, H., Zou, C.Y., Ma, Y.Z., Huo, J.Y., et al.: CE10:Integrated in-loop filter based on CNN (Tests 2.1, 2.2 and 2.3). In: the JVETmeeting. Number JVET-O0079, Gothenburg, Sweden: ITU-T, ISO/IEC (2019)

46. Hashimoto, T., Sasaki, E., Ikai, T.: AHG9: Separable convolutional neural networkfilter with squeeze-and-excitation block. In: the JVET meeting. Number JVET-K0158, Ljubljana, Slovenia: ITU-T, ISO/IEC (2018)

47. Wang, Y., Zhu, H., Li, Y., Chen, Z., Liu, S.: Dense residual convolutional neuralnetwork based in-loop filter for HEVC. In: 2018 IEEE Visual Communications andImage Processing (VCIP). (Dec 2018) 1–4

48. Wang, D., Xia, S., Yang, W., Hu, Y., Liu, J.: Partition tree guided progressiverethinking network for in-loop filtering of HEVC. In: 2019 IEEE InternationalConference on Image Processing (ICIP), IEEE (2019) 2671–2675

49. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuringecological statistics. In: Proceedings Eighth IEEE International Conference onComputer Vision. ICCV 2001. Volume 2. (July 2001) 416–423 vol.2

50. Seshadrinathan, K., Soundararajan, R., Bovik, A.C., Cormack, L.K.: Study ofsubjective and objective quality assessment of video. IEEE transactions on ImageProcessing 19(6) (2010) 1427–1441

51. Lin, J.Y., Song, R., Wu, C.H., Liu, T., Wang, H., Kuo, C.C.J.: MCL-V: A streamingvideo quality assessment database. Journal of Visual Communication and ImageRepresentation 30 (2015) 1–9

52. Keimel, C., Habigt, J., Habigt, T., Rothbucher, M., Diepold, K.: Visual qualityof current coding technologies at high definition iptv bitrates. In: 2010 IEEEInternational Workshop on Multimedia Signal Processing, IEEE (2010) 390–393

53. Nah, S., Timofte, R., Baik, S., Hong, S., Moon, G., Son, S., Mu Lee, K.: NTIRE2019 challenge on video deblurring: Methods and results. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition Workshops. (2019)0–0

54. Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actionsclasses from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

Page 18: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

18 Di Ma, Fan Zhang, and David R. Bull†

55. Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis usingdeep voxel flow. In: Proceedings of the IEEE International Conference on ComputerVision. (2017) 4463–4471

56. Szeto, R., Sun, X., Lu, K., Corso, J.J.: A temporally-aware interpolation networkfor video frame inpainting. IEEE transactions on pattern analysis and machineintelligence (2019)

57. Zhang, Z., Chen, L., Xie, R., Song, L.: Frame interpolation via refined deep voxelflow. In: 2018 25th IEEE International Conference on Image Processing (ICIP),IEEE (2018) 1473–1477

58. Zhang, F., Bull, D.R.: A parametric framework for video compression using region-based texture models. IEEE Journal of Selected Topics in Signal Processing 5(7)(2011) 1378–1392

59. Videvo Free Stock Video Footage. https://www.videvo.net/

60. IRIS32FREE 4K UHD FREE FOOTAGE. https://www.youtube.com/channel/

UCjJee-JAzoRRH5T0wqe7_tw/

61. Harmonic Inc 4K demo footage. http://www.harmonicinc.com/

4k-demo-footage-download/ (Accessed: 1st May 2017).62. Papadopoulos, M.A., Zhang, F., Agrafiotis, D., Bull, D.: A video texture database

for perceptual compression and quality assessment. In: 2015 IEEE InternationalConference on Image Processing (ICIP), IEEE (2015) 2781–2785

63. Cheon, M., Lee, J.S.: Subjective and objective quality assessment of compressed 4kuhd videos for immersive experience. IEEE Transactions on Circuits and Systemsfor Video Technology 28(7) (2017) 1467–1480

64. Mackin, A., Zhang, F., Bull, D.R.: A study of high frame rate video formats. IEEETransactions on Multimedia 21(6) (2018) 1499–1512

65. Song, L., Tang, X., Zhang, W., Yang, X., Xia, P.: The SJTU 4k video sequencedataset. In: 2013 Fifth International Workshop on Quality of Multimedia Experi-ence (QoMEX), IEEE (2013) 34–35

66. Bampis, C.G., Li, Z., Moorthy, A.K., Katsavounidis, I., Aaron, A., Bovik, A.C.:Study of temporal effects on subjective video quality of experience. IEEE Trans-actions on Image Processing 26(11) (2017) 5217–5231

67. Bampis, C.G., Li, Z., Katsavounidis, I., Huang, T.Y., Ekanadham, C., Bovik,A.C.: Towards perceptually optimized end-to-end adaptive video streaming. arXivpreprint arXiv:1808.03898 (2018)

68. Mitch Martinez FREE 4K STOCK FOOTAGE. http://mitchmartinez.com/

free-4k-red-epic-stock-footage/

69. Dareful-Completely Free 4K Stock Video. https://www.dareful.com/

70. Wang, H., Gan, W., Hu, S., Lin, J.Y., Jin, L., Song, L., Wang, P., Katsavounidis,I., Aaron, A., Kuo, C.C.J.: MCL-JCV: a JND-based H.264/AVC video qualityassessment dataset. In: 2016 IEEE International Conference on Image Processing(ICIP), IEEE (2016) 1509–1513

71. Katsavounidis, I.: Chimera video sequence details and scenes. tech. rep., Netflix(November 2015).

72. Keimel, C., Redl, A., Diepold, K.: The tum high definition video datasets. In:2012 Fourth International Workshop on Quality of Multimedia Experience, IEEE(2012) 97–102

73. Ultra Video Group. http://ultravideo.cs.tut.fi/#testsequences/

74. Moss, F.M., Wang, K., Zhang, F., Baddeley, R., Bull, D.R.: On the optimal pre-sentation duration for subjective video quality assessment. IEEE Transactions onCircuits and Systems for Video Technology 26(11) (2015) 1977–1987

Page 19: Compression arXiv:2003.13552v1 [eess.IV] 30 Mar 2020including 69 sequences from the Videvo Free Stock Video Footage set [59], 37 from the IRIS32 Free 4K Footage set [60], 25 from the

BVI-DVC: A Training Database for Deep Video Compression 19

75. Winkler, S.: Analysis of public image and video databases for quality assessment.IEEE Journal of Selected Topics in Signal Processing 6(6) (2012) 616–625

76. Afonso, M., Zhang, F., Bull, D.R.: Video compression based on spatio-temporalresolution adaptation. IEEE Transactions on Circuits and Systems for Video Tech-nology 29(1) (2019) 275–280

77. Zhang, F., Afonso, M., Bull, D.R.: Enhanced video compression based on effectivebit depth adaptation. In: 2019 IEEE International Conference on Image Processing(ICIP), IEEE (2019) Accepted

78. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition.(2016) 770–778

79. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connectedconvolutional networks. In: Proceedings of the IEEE conference on computer visionand pattern recognition. (2017) 4700–4708

80. Afonso, M., Zhang, F., Katsenou, A., Agrafiotis, D., Bull, D.: Low complexityvideo coding based on spatial resolution adaptation. In: 2017 IEEE InternationalConference on Image Processing (ICIP), IEEE (2017) 3011–3015

81. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

82. Zhang, F., Moss, F.M., Baddeley, R., Bull, D.R.: BVI-HD: A video quality databasefor HEVC compressed and texture synthesized content. IEEE Transactions onMultimedia 20(10) (2018) 2620–2630