Implementation And Improvement Of Wavefront Parallel Processing For HEVC Encoding On Many-core...

Preview:

Citation preview

Implementation And Improvement Of Wavefront Parallel Processing For HEVC

Encoding On Many-core Platform

Shaobo Zhang, Xiaoyun Zhang, Zhiyong Gao

2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)

2

Outline

• Introduction• Proposed Method• Experimental Results• Conclusion

3

Introduction

• In HEVC, two parallel tools, Tile and WPP, are presented to facilitate high level parallel processing.

• Compared with slice and Tile, WPP neither changes the regular raster scan order nor breaks coding dependencies at rows boundaries.

• WPP may often provide better compression performance and avoid some visual artifacts that may be induced by Tile and slice parallelism.

4

Introduction(Cont.)• Several related works focus on improving

parallelism of HEVC.• Chi[4] presents a novel approach called

Overlapped Wavefront (OWF) is provided to enhance the parallel efficiency of WPP.

• Yan[5] utilizes the data dependencies among neighboring CTUs and PU regions to exploit the implicit parallelism.

• [4] C. C. Chi et al., “Parallel scalability and efficiency of HEVC parallelization approaches,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, pp. 1827–1838, Dec. 2012.

• [5] Chenggang Yan et al., “Highly parallel framework for HEVC motion estimation on many-core platform,” Proc. DCC, pp. 63-72, Mar. 2013.

5

Introduction(Cont.)

• WPP and its applications still have some shortages.– HEVC test model(HM) is a single-core codec, thus

the serial realization of WPP in HM is not suitable for HEVC encoding on many-core platform.

– Due to the wavefront dependencies, it will introduce parallelization inefficiencies and becomes worse when a high number of processors is utilized.

6

Proposed Method

• Besides the first row of a slice, WPP requires control signaling to inform whether the top-right CTU in previous row has been encoded when processing a CTU.

• Additional memory to store side information and probabilities of CABAC are required by the next rows.

7

Proposed Method(Cont.)

• Try-and-wait mechanism is presented to apply WPP for HEVC encoder on many-core platform.– The control signaling are stored CTU by CTU, thus

W × H bytes are required.– Current CTU should check whether the top-right

CTU in previous row has been done before its processing. If not, the correspond core should wait and attempt again.

8

• Ping-pang storage is utilized to reduce memory for side information storage.

9

• Data reuse structure is also utilized for probabilities storage of CABAC.– Probabilities of previous row have been utilized

and unnecessary any more, thus they can be write off by the newest probabilities. Data reuse structure can reduce 88% for probabilities storage.

• Based on the above methods, WPP is realized for real-time HEVC encoder efficiently on many-core platform.

10

Proposed Method(Cont.)

• Parallel scalability model of WPP

– When the encoding speed ceases to increase with the increase of cores, the encoder gets to its Maximum Parallel Scalability (MPS)

• k : number of cores.• n : CTU units (rows, Tile or slice) number in one frame.

11

Proposed Method(Cont.)

• α : remaining rows.• u = ceil(H/k)• v = (H−1)mod k

12

Proposed Method(Cont.)

• Improvement of parallel scalability for WPP– Reduce CTU size– Combine WPP with slice-level parallelism– Combine WPP with frame-level parallelism

13

Proposed Method(Cont.)

• Reduce CTU size– The reduction of CTU size is an efficient way to

increase the height of CTU rows and improve the parallel scalability accordingly.

14

Proposed Method(Cont.)– Although the reduction of CTU size can increase

the parallel scalability of WPP effectively, however, it decreases the coding efficiency.

– Kim[6] proves that BD-rate drops about 3.4% to 14.4% performance loss when CTU size decreases from 32 × 32 to 16 × 16.

– CTU size of 32×32 would be preferable to balance the parallelism and performance loss.

• [6] Kim et al., “Block partitioning structure in the HEVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, pp. 1649–1668, Dec. 2012.

15

Proposed Method(Cont.)

• Combine WPP with slice-level parallelism– Slice-level parallelism, such as slice and Tile, can

break some dependencies among rows, thus the parallel scalability can be enhanced when they combined with WPP.

– Clare[7] implements two type of combinations of Tile and WPP, which divide frame into two independent or dependent Tiles side-by-side and each Tile is wavefront processed.

• [7] G. Clare et al., “Wavefront parallel processing for HEVC encoding and decoding,” JCTVCF0274, July. 2011.

16

Proposed Method(Cont.)

– Combination of 2-4 slices and WPP under 32 × 32 CTU size will bring promising parallel scalability while keep minor performance loss.

• m : number of slices or tiles.• Hm = H/m.• v' = (Hm−1) mod [floor(k/m)]

17

Proposed Method(Cont.)

18

Proposed Method(Cont.)

• Combine WPP with frame-level parallelism– Two GOP structures, IPpP and IPpp, are introduced

to improve parallelism, where I and P can be used as reference frame while p(denotes as disposable frame) can not be used as reference.

– When a row has been encoded and no more tasks are available in current picture, WPP combined with frame-level parallelism will start next 1−3 frames simultaneously.

19

Proposed Method(Cont.)– It can be inferred that H −2 cores are enough for

the encoding in parallel.– Start time can be deduced as NW + 2Nr + 1.– Finish moment of the Nth picture can be deduced

as (N + 2)W + 2Nr + 2

• r : maximum vertical search range.• N : Nth picture.

20

Proposed Method(Cont.)

– Finishing moment of the N frame is (α + 2)W + 2αr + 2– (p+1)(H −r) cores are enough to attain its MPS

• r : maximum vertical search range.• p : number of disposable frame.• α = ceil[ N/(p+1) ].

21

Experimental Results• Test sequences and encode environments– Adopt an encoder named FHM10.0 migrated from HEVC

reference software HM10.0.– The input videos in our experiments contain a list of

standard test sequences with 100 frames, and motion search range is set to 64.

– Select the Main profile and the default encoding test conditions are specified in [8].

– The experiment platform of this paper is based on GX36, which is a member of TILERA many-core processor family and contains 36 processing cores.

• [8] F. Bossen, “Common test conditions and software reference configurations,” JCTVCI1100, Apr. 2012.

22

Experimental Results

• Parallel scalability analysis

23

24

25

Conclusion

• Several effective methods, such as try-and-wait data interface, ping-pang storage and data reuse structure, are presented to realize WPP on HEVC encoder in parallel.

• Three effective methods are presented to improve parallel scalability of WPP.

• Experimental results show that our proposed methods improve more than 40% maximum parallel scalability when compared with WPP.

Recommended