6
Recongurable Hardware-Friendly CU-Group Based Merge/Skip Mode for High Ef cient Video Coding Wei Dai, Oscar C. Au, Xing Wen, Wenjing Zhu, Feng Zou, Xingyu Zhang, Vinit Jakhetiya Department of Electronic and Computer Engineering, HKUST, Clear Water Bay, Kowloon, Hong Kong {weidai, eeau, wxxab, wzhuaa, fengzou, eexyzhang, vjakhetiya}@ust.hk Abstract—Merge/skip mode is one of the most important inter prediction tools adopted in the High Efciency Video Coding (HEVC) standard which is the state-of-the-art video coding stan- dard. It is very efcient in reducing the side information for the blocks within the same object. However, it is difcult for parallel encoding and decoding due to the data dependency problem be- tween neighboring prediction units (PU). Furthermore, different shapes and positions of PUs would result in different denition of the merge/skip candidate list (MCL), which would lead to potentially extra hardware cost and is not easy to be efciently implemented by the hardware. To deal with this problem, two recongurable hardware-friendly MCL construction schemes are proposed in this paper. The rst scheme which is called unied MCL (UMCL) uses one candidate list for all PUs inside the motion estimation region (MER), which is regarded as the basic parallel processing unit for the hardware realization. The second scheme which is named boundary MCL (BMCL) allows different candidate lists for the PUs on the boundary of MER. Both of the two schemes can have exible parallel degree based on the requirement specication. Experimental results show that UMCL reduces the hardware complexity signicantly with little coding performance degradation and BMCL achieves signicant coding gain while maintaining the hardware complexity. I. I NTRODUCTION Most of the existing video coding standards, including MPEG1/2/4, H.261, H.263 and H.264, are based on hybrid video coding scheme, which means video is compressed using a hybrid of motion compensation and transform coding. These video coding algorithms compress the video by reducing the redundancies inherent in the raw video data. Since adjacent frames are highly correlated, inter prediction becomes one of the essential parts in video coding standard. In H.264, up to 7 different block sizes are used to deal with various content and motion within one macroblock. Moreover, skip mode [1] and direct mode [2] are introduced to handle the situation when the current block and its neighboring blocks belong to the same object and share similar motion properties. Although the state-of-the-art High Efciency Video Coding (HEVC) standard still adopts the hybrid video coding struc- ture, it allows a highly exible hierarchical unit representation. The concept of a macroblock as the basic processing unit in H.264 is generalized to coding unit (CU) in HEVC [3]. The size of largest CU (LCU) can be specied as side information in the bit stream. Each CU can be further partitioned into MMSP’13, Sept. 30 - Oct. 2, 2013, Pula (Sardinia), Italy. 978- 1-4799-0125-8/13/$31.00 c 2013 IEEE. Fig. 1. Example of one possible LCU partition. four sub CUs recursively by a quadtree-based structure. One example of possible partition of an LCU is shown in Fig. 1. For each CU, it can be further partitioned into several prediction units (PU) to further improve the coding performance. Fig. 2 lists all the possible choices of PU partition for one CU, including 4 symmetric mode partition (SMP) types and 4 asymmetric mode partition (AMP) types. In spite of different block sizes, merge/skip mode [4] is introduced in HEVC to reduce the side information when the current PU and its neighboring PUs are highly correlated. With these new features, the computational complexity of HEVC is much higher than H.264. For real-time implementa- tion, a better software design or hardware parallel scheme is needed which gives highly requirement on hardware-friendly algorithms. Multiple-processor and multiple-threading encod- ing system had been used for real-time video encoding [5]. In H.264, the biggest block size 16x16 was usually regarded as the basic unit for parallel realization. In [6], a single chip encoder for H.264 was used which was a four-stage macroblock pipeline architecture. Our previous work in [7] proposed to use a unied single motion vector predictor (MVP) in order to do motion estimation with different block sizes in a macroblock at one time. However in HEVC, there are so many different block sizes, it is more desirable to have exible block-size parallel approach to balance the coding performance and hardware implementation cost. Since the coding structure of HEVC is very different from H.264, a lot of works which were based on H.264 could not be applied directly to HEVC. For example, skip mode and direct mode were only enabled when block size was 16x16 in H.264, MMSP2013 046

[IEEE 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP) - Pula (CA), Italy (2013.09.30-2013.10.2)] 2013 IEEE 15th International Workshop on Multimedia Signal

  • Upload
    vinit

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Reconfigurable Hardware-Friendly CU-Group BasedMerge/Skip Mode for High Efficient Video Coding

Wei Dai, Oscar C. Au, Xing Wen, Wenjing Zhu, Feng Zou, Xingyu Zhang, Vinit JakhetiyaDepartment of Electronic and Computer Engineering, HKUST, Clear Water Bay, Kowloon, Hong Kong

{weidai, eeau, wxxab, wzhuaa, fengzou, eexyzhang, vjakhetiya}@ust.hk

Abstract—Merge/skip mode is one of the most important interprediction tools adopted in the High Efficiency Video Coding(HEVC) standard which is the state-of-the-art video coding stan-dard. It is very efficient in reducing the side information for theblocks within the same object. However, it is difficult for parallelencoding and decoding due to the data dependency problem be-tween neighboring prediction units (PU). Furthermore, differentshapes and positions of PUs would result in different definitionof the merge/skip candidate list (MCL), which would lead topotentially extra hardware cost and is not easy to be efficientlyimplemented by the hardware. To deal with this problem, tworeconfigurable hardware-friendly MCL construction schemes areproposed in this paper. The first scheme which is called unifiedMCL (UMCL) uses one candidate list for all PUs inside themotion estimation region (MER), which is regarded as the basicparallel processing unit for the hardware realization. The secondscheme which is named boundary MCL (BMCL) allows differentcandidate lists for the PUs on the boundary of MER. Both ofthe two schemes can have flexible parallel degree based on therequirement specification. Experimental results show that UMCLreduces the hardware complexity significantly with little codingperformance degradation and BMCL achieves significant codinggain while maintaining the hardware complexity.

I. INTRODUCTIONMost of the existing video coding standards, including

MPEG1/2/4, H.261, H.263 and H.264, are based on hybridvideo coding scheme, which means video is compressed usinga hybrid of motion compensation and transform coding. Thesevideo coding algorithms compress the video by reducing theredundancies inherent in the raw video data. Since adjacentframes are highly correlated, inter prediction becomes one ofthe essential parts in video coding standard. In H.264, up to 7different block sizes are used to deal with various content andmotion within one macroblock. Moreover, skip mode [1] anddirect mode [2] are introduced to handle the situation when thecurrent block and its neighboring blocks belong to the sameobject and share similar motion properties.Although the state-of-the-art High Efficiency Video Coding

(HEVC) standard still adopts the hybrid video coding struc-ture, it allows a highly flexible hierarchical unit representation.The concept of a macroblock as the basic processing unit inH.264 is generalized to coding unit (CU) in HEVC [3]. Thesize of largest CU (LCU) can be specified as side informationin the bit stream. Each CU can be further partitioned into

MMSP’13, Sept. 30 - Oct. 2, 2013, Pula (Sardinia), Italy. 978-1-4799-0125-8/13/$31.00 c©2013 IEEE.

Fig. 1. Example of one possible LCU partition.

four sub CUs recursively by a quadtree-based structure. Oneexample of possible partition of an LCU is shown in Fig. 1. Foreach CU, it can be further partitioned into several predictionunits (PU) to further improve the coding performance. Fig.2 lists all the possible choices of PU partition for one CU,including 4 symmetric mode partition (SMP) types and 4asymmetric mode partition (AMP) types. In spite of differentblock sizes, merge/skip mode [4] is introduced in HEVCto reduce the side information when the current PU and itsneighboring PUs are highly correlated.With these new features, the computational complexity of

HEVC is much higher than H.264. For real-time implementa-tion, a better software design or hardware parallel scheme isneeded which gives highly requirement on hardware-friendlyalgorithms. Multiple-processor and multiple-threading encod-ing system had been used for real-time video encoding [5].In H.264, the biggest block size 16x16 was usually regardedas the basic unit for parallel realization. In [6], a singlechip encoder for H.264 was used which was a four-stagemacroblock pipeline architecture. Our previous work in [7]proposed to use a unified single motion vector predictor(MVP) in order to do motion estimation with different blocksizes in a macroblock at one time. However in HEVC, thereare so many different block sizes, it is more desirable to haveflexible block-size parallel approach to balance the codingperformance and hardware implementation cost.Since the coding structure of HEVC is very different from

H.264, a lot of works which were based on H.264 could not beapplied directly to HEVC. For example, skip mode and directmode were only enabled when block size was 16x16 in H.264,

MMSP2013046

Fig. 2. Choices of PU partition for one CU.

while in HEVC, merge/skip mode is enabled for all possiblePU sizes. For merge/skip mode, first of all, a merge/skipcandidate list (MCL) is constructed, and the index of the bestcandidate will be encoded and transmitted to the decoder. Oneproperty of the merge/skip mode is that the construction ofMCL needs the motion information of neighboring PUs, sothis dependency makes the MCL derivation process difficult toparallel. Several proposals [8], [9], [10], [11], [12], [13], [14],[15], [16] were proposed to deal with this problem. Especially,the scheme in [12] (called Scheme H0082 for the rest of thispaper) was accepted into the current HEVC standard.In this paper, two flexible hardware-friendly parallel ap-

proaches of merge/skip mode are proposed. The rest of thispaper is organized as follows. Section II introduces the currentmerge/skip mode design in HEVC and discusses the difficul-ties of parallel merge/skip mode. In Section III, two proposedparallel merge/skip schemes are discussed and the complexityanalysis is given in Section IV. Experimental results are shownin Section V and Section VI concludes the paper.

II. OVERVIEW OF MERGE/SKIP MODE IN HEVC

A. Traditional Merge/skip mode in HEVCThe current HEVC merge/skip mode just copies motion

parameters, such as reference frame index, prediction directionand motion vector (MV), to current PU from a candidate listwhich consists of spatial and temporal neighboring PUs. Fig.3 illustrates the MCL construction of the merge/skip modedefined in HM4.0. For MCL of current PU, a total of fiveneighboring PUs including four spatial PUs chosen from left(A), above (B), above-right (C), bottom-left (D) or above-left (E) and one temporal collocated PU chosen from right-bottom (F) or center (G) are involved to form the MCL. Atthe encoder side, after constructing the MCL, the candidatewith best performance in the sense of rate-distortion (RD)cost is selected. At the decoder side, the MCL is firstlyconstructed, and then the index of best candidate is parsedfrom the bitstream.Table. I shows the percentage of blocks that choose

merge/skip mode as the best mode for some sequences underlowdelay-loco (LD-LC) condition. It can be observed that thepercentage of merge/skip mode is very high. Especially forthose sequences which do not have large or irregular motion,

TABLE IPERCENTAGE OF MERGE/SKIP MODE SELECTION AT QP=32 UNDER

LD-LC CONDITION.

Sequence Name Merge/Skip Mode Other ModesBasketballPass 68.6% 31.4%BasketballDrill 73.9% 26.1%

Vidyo1 89.3% 10.7%BQTerrace 81.6% 18.4%

Fig. 3. Illustration of the MCL construction in HM4.0.

such as Vidyo1 and BQTerrace. More than 80% of the totalblocks are chosen to be merge/skip mode. Moreover, for fastmotion sequences, such as BasketballPass and BasketballDrill,the percentage of merge/skip mode is still more than 60%. Thismeans the merge/skip mode is very powerful and importantamong the inter prediction tools. However, because of theproperties of MCL construction, the process of traditionalmerge/skip mode checking is highly sequential and cannot beparalleled, details are discussed below.

B. Problems of the Parallel Merge/skip Mode DesignAs described above, due to the spatial varying definition

of MCL, the existing design for merge/skip mode is nothardware-friendly. For CUs, at encoder side, only the PUswhose top-left corner overlap with the top-left corner ofcurrent CU (e.g. the PU0s in Fig. 2) can check their merge/skipmode in parallel. The other possible PU partitions need towait until its preceding PUs finish their motion estimationprocess. Moreover, at decoder side, it is difficult to do motioncompensation in parallel according to the current design. Forexample, if one CU contains more than one PU, it has todo the motion compensation sequentially, since the spatialvarying definition of MCL causes data dependency betweenneighboring PUs, which limits the throughput at the decoderside.The encoder and decoder throughput comparison of the

parallel motion estimation in H.264 and HEVC are comparedin Fig. 4. For encoder, in H.264, the skip MV derivationand skip search are on 16x16 block level and can be fullyparallelized with regular motion estimation. While in HEVC,coding performance is improved by not only allowing 16x16block but also all the block sizes to have merge/skip mode.Suppose the size of processing unit is 16x16, and it is dividedinto four 8x8 PUs, all these four PUs can have merge/skipmode. However, due to the dependency of the MCL derivation,

MMSP2013047

Fig. 4. Encoder and decoder throughput comparison between H.264 and HEVC at 16x16 block level.

in this case, only the MCL derivation and merge/skip motionestimation (MME) of the first 8x8 PU (e.g. PU0) can runparallel with the regular ME, the MCL derivation and MMEof other PUs can only run sequentially after the regular ME isdone, which costs additional cycles. Therefore, to exploit fullypotential performance of HEVC, the HEVC encoder needsmore time to complete motion estimation process than H.264.For decoder, in H.264, if the best mode for the macroblock wasskip mode, it was very easy for hardware realization since theprocessing unit is the same as the block size. However, forHEVC, if the four PUs all use merge/skip mode, the MCLconstruction has to be done sequentially for each PU. Socurrent design of merge/skip mode in HEVC greatly limitsthe throughput both at encoder and decoder side.In the original parallel merge/skip architecture (OPMA) of

HEVC, encoder chooses to skip MCL derivation for thosePUs whose MCLs cannot be constructed due to the datadependency inside the processing unit (e.g. The additionalcycles part in Fig. 4), thus the merge/skip mode checking isdisabled for those PUs. This architecture will surely causequality loss but meet the throughput requirement.Moreover, the MV information from neighboring PUs is

quarter-pixel accuracy. So in order to do motion compensationfor those candidates, interpolation is also needed for non-integer MVs. For each PU, 5 interpolations are required tocheck all the candidates in the worst case. Since the MCLfor PUs are different, so interpolation also consumes a lot ofcomputational power in the merge/skip mode checking.Therefore, in order to design the parallel merge/skip mode,

the MCL derivation and MME process should firstly bedecoupled from the regular motion estimation. Secondly, theremaining dependency at CU and its sub CUs should also be

removed. Thirdly, to enable flexible trade-off between codingefficiency and throughput on the encoder side, configurablealgorithms should be designed. Last but the most important,the number of MCL derivation and interpolation operationsshould be reduced to lower the computational burden and makethe data flow regular.Based on the analysis above, two configurable hardware-

friendly schemes are proposed to deal with these problems.The first scheme which is called unified MCL construct-s one MCL for all the PUs inside the motion estimationregion (MER), which is the basic parallel processing unit.This modification can reduce the number of MCL derivationand interpolation operation significantly with affordable per-formance loss. The second scheme which named boundaryMCL allows the PUs which are on the left or top of theMER boundary to have their own MCLs to improve codingperformance while maintaining the parallel properties. Thisscheme can achieve significant coding gain while keepingsimilar complexity compared to Scheme H0082. Details ofthe proposed schemes are discussed in the next section.

III. PROPOSED PARALLEL SCHEMESIn order to make MCL construction process hardware-

friendly, two parallel schemes are proposed in this section.Since flexible block-size parallel approach is preferred, parallelprocessing level (PPL) is defined and transmitted to thedecoder to indicate which block size is used as the MER.Based on the value of PPL, an LCU is divided into a numberof non-overlapped MERs. Each MER is equal size and squareshape. The relationship between MER and CU is that eachMER can have multiple CUs inside it, however, these CUswill be processed at one time for hardware. The number and

MMSP2013048

Fig. 5. Illustration of the MCL construction for UMCL scheme.

size of MER for different PPL are listed in Table II when theLCU is 64x64. Note that when PPL = 4, the proposed methodsbecome the same as current merge/skip mode in referencesoftware HM4.0.

A. Scheme I: Unified Merge/skip Candidate List (UMCL)

If we consider each MER as the basic parallel processingunit, then in the MCL construction process, the candidateswithin the same MER are not available because the neighbor-ing PUs are not encoded yet. In order to remove dependencybetween neighboring PUs inside the MER, in this scheme,all the PUs inside the MER use the same MCL, which isthe list when the current MER is regarded as one PU. Thus,before one MER is parallel processed, the MCL has alreadybeen constructed and all the PUs inside it use the same listto do the merge/skip mode checking. So parallel processingof the merge/skip mode for all the PUs inside one MER isachieved. Fig. 5 gives one example of UMCL construction,three PUs, PU0, PU1 and PU2, use the same MCL which isthe MCL of the MER. The 4 spatial candidates are chosenfrom A,B,C,D and E. Taking PU2 as example, since all thePUs inside the MER are processed at same time, the originalspatial candidates of PU2, which marked as dark gray in Fig.5, are all not available, so in UMCL, PU2 shares the sameMCL with the other PUs inside the MER.

B. Scheme II: Boundary Merge/skip Candidate List (BMCL)

UMCL is the most hardware-friendly scheme because itonly needs to construct the MCL once for all PUs inside theMER. But for PUs which are on the top or left boundary ofthe MER, some of its spatial candidates are actually availableand these candidates are surely better than the candidates inUMCL. Also, these candidates can be used to construct theirMCLs without causing any dependency problem. So in thisscheme, for those boundary PUs, if one of its candidate isoutside the MER, then this candidate will replace the corre-sponding candidate in the UMCL. For temporal merge/skip

Fig. 6. Illustration of the MCL construction for BMCL scheme.

TABLE IIMER NUMBER AND SIZE FOR DIFFERENT PPL WHEN LCU SIZE IS 64X64.

PPL MER number MER size0 1 64x641 4 32x322 16 16x163 64 8x84 256 4x4(Ref. software)

candidate, each PU uses its own temporal merge/skip candi-date. Fig. 6 illustrated the MCL construction for PUs insideone MER. For PU0 in Fig. 6, since all its spatial candidatesare available, so the spatial candidates are chosen from theblocks F,G,H, I and E. For PU1, the top-left, top and top-right candidates G, J and K are available while the left andbottom-left blocks are not available, so blocks G, J and K areused to replace the candidates in the UMCL to construct theMCL for the PU1. For PU2, it uses the UMCL since all of itsneighboring PUs are not available.

IV. COMPLEXITY ANALYSIS OF THE PROPOSEDMERGE/SKIP PARALLEL SCHEME

To analyse the complexity of the proposed schemes, weassume the worst case situation at the decoder side where oneMER is subdivided into PUs of 4x4 size. In HEVC, each 4x4PU has its own candidate list with up to 5 candidates. In theworst case, one LCU needs to access memory 256x5 times forreference pixel data during the motion compensation. As to theproposed schemes, for UMCL, with the help of only one MCL,we can reduce the number of memory access significantly.Also, at the decoder side, this scheme can simplify the logicto derive the MCL for each PU since it only require onetime derivation for all PUs inside the MER. For BMCL, thecandidate list of the boundary 4x4 PUs are different fromthe UMCL, some of their candidates can be derived withoutcausing the dependency problem, while other PUs which arenot on the top or left boundary of the MER, they still usethe UMCL for merge/skip mode checking. Moreover, the

MMSP2013049

TABLE IIICOMPLEXITY ANALYSIS OF ANCHOR HM4.0, UMCL, BMCL, OPMA AND SCHEME H0082.

Methods MERsize

Numberof MER MV memory access for each MER Reference pixel memory

access for each MERTotal interpolation

operationsComplexityreduction

Anchor HM4.0 - - (16x2+2+255)+256=545a 256x5=1280 1280 -

UMCL

64x64 1 5 5 5x1=5 99.6%32x32 4 5 5 5x4=20 98.4%16x16 16 5 5 5x16=80 93.8%8x8 64 5 5 5x64=320 75%

BMCL

64x64 1 (16x2+2)+256=290 4+4x31+256=384 384x1=384 70%32x32 4 (8x2+3)+64=83 4+4x15+64=128 128x4=512 60%16x16 16 (4x2+3)+16=27 4+4x7+16=48 48x16=768 40%8x8 64 (2x2+3)+4=11 4+4x3+4=20 20x64=1280 0%

OPMA

64x64 1 5b 5 5x1=5 99.6%32x32 4 5 5 5x4=20 98.4%16x16 16 5 5 5x16=80 93.8%8x8 64 5 5 5x64=320 75%

Scheme H0082

64x64 1 (16x2+2)+256=290 4+4x31+256=384 384x1=384 70%32x32 4 (8x2+3)+64=83 4+4x15+64=128 128x4=512 60%16x16 16 (4x2+3)+16=27 4+4x7+16=48 48x16=768 40%8x8 64 (2x2+3)+4=11 4+4x3+4=20 20x64=1280 0%

a In the anchor HM4.0, only the bottom right 4x4 block will not be used as the spatial candidate, so additional 255 blocks are added into MV accesscalculation. Also, the bottom-left block is not coded yet so only 2 corner blocks (above-left and above-right) and 32 boundary blocks can be consideredas spatial neighboring MVP candidate.

b For the OPMA method, only the top-left 4x4 PU will be checked.

complexity of Scheme H0082 which was accepted into theHEVC standard is also analyzed. For each PU in the MER, ifone of the candidates is inside the MER, the Scheme H0082just simply disable this candidate and use those candidateswhich are outside the MER for merge/skip mode checking toavoid the data dependency problem. So for the PUs which arenot on the block boundary, Scheme H0082 will just simply usethe temporal candidate for the merge/skip mode checking. Thecomplexity analysis of the anchor HM4.0, UMCL, BMCL,OPMA and Scheme H0082 are given in Table. III.Several conclusions can be made from Table III. For UMCL,

because there is only one MCL, the hardware only needs toconstruct the MCL once and does the interpolation operationfive times for each MER which greatly reduces the complexityand computational burden for the hardware design. For BMCL,although the complexity is quite similar compared to SchemeH0082, it allows PUs to test more merge/skip candidateswhich will potentially introduce some coding gain comparedto Scheme H0082.

V. EXPERIMENTAL RESULTSThe simulation of the proposed algorithm is implemented in

the HEVC reference software HM4.0. The simulations followthe common testing conditions specified in [17], four condi-tions are used which are randomaccess (RA-HE), randomac-cess_loco (RA-LC), lowdelay (LD-HE) and lowdelay_loco(LD-LC). BD-rate [18] is measured to evaluate the perfor-mance of proposed algorithm. Since when the size of MERis small, typically smaller than 16x16, it is not efficient andpractical to use such a small MER to do parallel processing,so in this paper, MER size between 64x64 to 16x16 are

TABLE IVSIMULATION RESULTS OF PROPOSED UMCL AND BMCL SCHEMES

COMPARED TO HM4.0 ANCHOR.Method UMCL BMCLMER size 64x64 32x32 16x16 64x64 32x32 16x16RA-HE 2.9% 1.7% 0.7% 1.7% 0.9% 0.2%RA-LC 3.0% 1.7% 0.7% 1.8% 0.9% 0.3%LD-HE 4.4% 2.8% 1.2% 2.6% 1.3% 0.3%LD-LC 4.9% 2.8% 1.1% 2.9% 1.4% 0.4%

TABLE VSIMULATION RESULTS OF PROPOSED UMCL AND BMCL SCHEMES

COMPARED TO OPMA.Method UMCL BMCLMER size 64x64 32x32 16x16 64x64 32x32 16x16RA-HE -2.7% -3.1% -1.7% -4.8% -3.9% -2.1%RA-LC -3.9% -3.4% -1.6% -5.4% -4.1% -2.0%LD-HE -3.4% -3.7% -2.1% -5.1% -5.1% -2.9%LD-LC -5.3% -4.6% -3.1% -7.1% -5.9% -2.9%

tested. The proposed schemes are compared with the anchorHM4.0, the OPMA scheme described in the previous sectionand Scheme H0082. Experimental results are shown in TableIV, V and VI respectively.From the comparison of the proposed schemes with the

HM4.0 anchor, it can be observed that for UMCL scheme,when MER size is 64x64, the performance loss is relativelysignificant, which means the parallelism of merge/skip mode atLCU level is not very satisfying for UMCL scheme. Howeverfor the other cases, the performance loss becomes affordableand also the flexible parallel approach of merge/skip mode is

MMSP2013050

TABLE VISIMULATION RESULTS OF PROPOSED UMCL AND BMCL SCHEMES

COMPARED TO SCHEME H0082.Method UMCL BMCLMER size 64x64 32x32 16x16 64x64 32x32 16x16RA-HE 0.4% 0.3% 0.2% -0.8% -0.6% -0.2%RA-LC 0.2% 0.1% 0.1% -1.0% -0.7% -0.3%LD-HE 1.0% 0.8% 0.5% -0.9% -0.7% -0.3%LD-LC 0.7% 0.5% 0.4% -1.3% -0.9% -0.3%

achieved.When comparing the performance of the proposed methods

with the OPMA scheme, we can see that both the UMCL andBMCL outperform the OPMA scheme, especially for the largeMER size. This is because when MER is large, only the top-left PU can test the merge/skip mode in the OMPA schemeand others cannot while in UMCL and BMCL, all the PUs canstill have a try on the merge/skip mode. So when the MERsize is large, both the proposed schemes will result in a betterperformance.As for the Scheme H0082, although the performance of

UMCL is a little worse than Scheme H0082, the hardwarecomplexity of UMCL is greatly reduced compared to SchemeH0082 according to Table. III. Moreover, the performance ofthe BMCL scheme is better than Scheme H0082 while main-taining the same hardware complexity compared to SchemeH0082.

VI. CONCLUSION

In this paper, two schemes of parallel merge/skip mode areproposed. The first scheme UMCL only allows one MCL forall the PUs inside the MER, while the second scheme BMCLallows different MCLs for the PUs on the top or left boundaryof the MER. By using these two schemes, the parallelism ofthe merge/skip mode for all PUs inside one MER is achieved.Moreover, the proposed methods are hardware-friendly andcan offer the flexibility of parallel degree by defining differentparallel processing levels. Compared to the method which wasadopted in the HEVC standard, experimental results show thatUMCL reduces the hardware complexity significantly withlittle coding performance loss and BMCL achieves significantcoding gain while maintaining the hardware complexity.

ACKNOWLEDGEMENT

This work has been supported in part by the Research GrantsCouncil (GRF Project no. 610112) and HKUST (HKUSTProject no. FSGRF12EG01) of the Hong Kong Special Ad-ministrative Region, China.

REFERENCES

[1] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview ofthe H.264/AVC video coding standard,” Circuits and Systems for VideoTechnology, IEEE Transactions on, vol. 13, no. 7, pp. 560–576, 2003.

[2] A. M. Tourapis, F. Wu, and S. Li, “Direct mode coding for bipredictiveslices in the H.264 standard,” Circuits and Systems for Video Technology,IEEE Transactions on, vol. 15, no. 1, pp. 119–126, 2005.

[3] D. Marpe, H. Schwarz, S. Bosse, B. Bross, P. Helle, T. Hinz, H. Kirch-hoffer, H. Lakshman, T. Nguyen, S. Oudin et al., “Video compressionusing nested quadtree structures, leaf merging, and improved techniquesfor motion representation and entropy coding,” Circuits and Systems forVideo Technology, IEEE Transactions on, vol. 20, no. 12, pp. 1676–1687,2010.

[4] Y. W. Huang, B. Bross, M. Zhou, W. J. Chien, and I. K. Kim, “De-scription of core experiment 9: MV coding and skip/merge operations,”Document of Joint Collaborative Team on Video Coding, JCTVC-E709,March 2011.

[5] Y. K. Chen, X. Tian, S. Ge, and M. Girkar, “Towards efficient multi-level threading of H.264 encoder on intel hyper-threading architectures,”in Parallel and Distributed Processing Symposium, 2004. Proceedings.18th International. IEEE, 2004, p. 63.

[6] Y. W. Huang, T. C. Chen, C. H. Tsai, C. Y. Chen, T. W. Chen, C. S.Chen, C. F. Shen, S. Y. Ma, T. C. Wang, B. Y. Hsieh et al., “A 1.3 TOPSH.264/AVC single-chip encoder for HDTV applications,” in Solid-statecircuits conference, 2005. Digest of technical papers. ISSCC. 2005 IEEEInternational. IEEE, 2005, pp. 128–588.

[7] X. Wen, O. C. Au, J. Xu, L. Fang, R. Cha, and J. Li, “Novel RD-optimized VBSME with matching highly data re-usable hardware archi-tecture,” Circuits and Systems for Video Technology, IEEE Transactionson, vol. 21, no. 2, pp. 206–219, 2011.

[8] M. Zhou, “Parallelized merge/skip mode for HEVC,” Document of JointCollaborative Team on Video Coding, JCTVC-F069, July 2011.

[9] Y. Jeon, S. Park, J. Park, and B. Jeon, “Non-CE9: improvement onparallelized merge/skip mode,” Document of Joint Collaborative Teamon Video Coding, JCTVC-G164, November 2011.

[10] X. Wen, O. C. Au, W. Dai, C. Pang, F. Zou, J. Dai, and X. Zhang, “De-scription of core experiment 9: MV coding and skip/merge operations,”Document of Joint Collaborative Team on Video Coding, JCTVC-G387,November 2011.

[11] M. Zhou, H. Y. Kim, P. ONNO, and X. Wen, “JCT-VC AHG report:Parallel merge/skip (AHG 10),” Document of Joint Collaborative Teamon Video Coding, JCTVC-H0010, February 2012.

[12] M. Zhou, “AGH10: Configurable and CU-group level parallelmerge/skip,” Document of Joint Collaborative Team on Video Coding,JCTVC-H0082, February 2012.

[13] Y. Jeon, B. Jeon, M. Zhou, W. Wen, O. C. Au, and H. Y. Kim, “AHG10:Unified design on parallel merge/skip,” Document of Joint CollaborativeTeam on Video Coding, JCTVC-H0090, February 2012.

[14] Y. Jeon, B. Jeon, V. Seregin, X. Wang, J. Chen, and M. Karczewicz,“Parallel merge candidate derivation for Inter_NxN partition type,”Document of Joint Collaborative Team on Video Coding, JCTVC-H0091,February 2012.

[15] ——, “Non-CE9: Removing PU dependency in TMVP reference indexderivation,” Document of Joint Collaborative Team on Video Coding,JCTVC-H0092, February 2012.

[16] H. Y. Kim, Y. Jeon, B. Jeon, M. Zhou, X. Wen, and O. C. Au, “AHG10:Unified design on parallel merge/skip with reduced candidates,” Doc-ument of Joint Collaborative Team on Video Coding, JCTVC-H0247,February 2012.

[17] F. Bossen, “Common test conditions and software reference configura-tions,” Document of Joint Collaborative Team on Video Coding, JCTVC-F900, July 2011.

[18] G. Bjøntegaard, “Improvements of the BD-PSNR model,” ITU-T SG16Q.6 Document, VCEG-AI11, July 2008.

MMSP2013051