37
1 Data Partition for Data Partition for Wavefront Parallelization Wavefront Parallelization of H.264 Video Encoder of H.264 Video Encoder Zhuo Zhao, Ping Liang Zhuo Zhao, Ping Liang IEEE ISCAS 2006

1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Embed Size (px)

Citation preview

Page 1: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

1

Data Partition for Wavefront Data Partition for Wavefront Parallelization of H.264 Video EncoderParallelization of H.264 Video Encoder

Zhuo Zhao, Ping LiangZhuo Zhao, Ping Liang

IEEE ISCAS 2006

Page 2: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

2

OutlineOutline

IntroductionIntroductionData Dependencies in H.264Data Dependencies in H.264Data Partition and Task PriorityData Partition and Task PriorityExperimental ResultsExperimental ResultsConclusionsConclusions

Page 3: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

3

IntroductionIntroductionBackground Knowledge (1/7)Background Knowledge (1/7)

Video compression technologiesVideo compression technologiesSpatial RedundancySpatial RedundancyTemporal RedundancyTemporal Redundancy

H.264/AVC new featuresH.264/AVC new featuresQuarter-pel ME, variable block sizes, multiple Quarter-pel ME, variable block sizes, multiple

reference frames, intra-prediction, CAVLC, reference frames, intra-prediction, CAVLC, CABAC, in-loop deblocking filter, etc.CABAC, in-loop deblocking filter, etc.

Page 4: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

4

IntroductionIntroductionBackground Knowledge (2/7)Background Knowledge (2/7)

In [1], compared with MPEG-4 Simple profileIn [1], compared with MPEG-4 Simple profileUp to Up to 50% bitrate reduction50% bitrate reduction is achieved at the cost is achieved at the cost

of more than of more than four times of computationfour times of computation..Bitrate Computation ComplexityBitrate Computation ComplexityHardware and Software acceleration for Hardware and Software acceleration for real-timereal-time

applicationsapplications

Page 5: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

5

IntroductionIntroductionBackground Knowledge (3/7)Background Knowledge (3/7)

In [2], a single chip encoder for H.264 using a In [2], a single chip encoder for H.264 using a four-stage macroblock pipeline architecture.four-stage macroblock pipeline architecture.Satisfactory R-D tradeoff is reported.Satisfactory R-D tradeoff is reported.Find the coding mode of current MB by Find the coding mode of current MB by

approximations of neighboring coding information.approximations of neighboring coding information.

Page 6: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

6

IntroductionIntroductionBackground Knowledge (4/7)Background Knowledge (4/7)

In [3], an H.264 encoder using the In [3], an H.264 encoder using the hyper-hyper-threadingthreading architecture is reported. architecture is reported.Split a frame into several slices and processed by Split a frame into several slices and processed by

multiple threads.multiple threads.Heavy overheads : The impairments to data Heavy overheads : The impairments to data

dependencies among MBs.dependencies among MBs.

Page 7: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

7

IntroductionIntroductionBackground Knowledge (5/7)Background Knowledge (5/7)

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Input File

Output File

Image buffer

Slice Queue 0 (I/P)

Slice Queue 1 (B)

Page 8: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

8

IntroductionIntroductionBackground Knowledge (6/7)Background Knowledge (6/7)

In [4], a frame is divided into many small partitions In [4], a frame is divided into many small partitions with with overlappingoverlapping areas and processed concurrently. areas and processed concurrently.Not feasible for H.264.Not feasible for H.264.Redundant dataRedundant data

form the complete form the complete

search datasearch data

Page 9: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

9

IntroductionIntroductionBackground Knowledge (7/7)Background Knowledge (7/7)

In [5][6], using temporal parallelism in GOP In [5][6], using temporal parallelism in GOP levellevelA large number of frames being ready before the A large number of frames being ready before the

encoding actually starts.encoding actually starts.Temporal parallelism is limited to coding Temporal parallelism is limited to coding

standards with GOP structure.standards with GOP structure.

Page 10: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

10

IntroductionIntroductionMain Purpose (1/2)Main Purpose (1/2)

This paper presents a new method for This paper presents a new method for parallel processing of H.264 video encoderparallel processing of H.264 video encoderData partitionData partitionTask schedulingTask scheduling

The new method outperforms prior The new method outperforms prior approaches in both approaches in both encoding speedencoding speed and and compression efficiencycompression efficiency..

Page 11: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

11

IntroductionIntroductionMain Purpose (2/2)Main Purpose (2/2)

This paper gives the relations betweenThis paper gives the relations between# of parallel processing element and theoretical # of parallel processing element and theoretical

encoding time.encoding time.# of processors and # of concurrently processed # of processors and # of concurrently processed

frames.frames.

The result shows that this method achieves the The result shows that this method achieves the same compression efficiency as a sequential same compression efficiency as a sequential processing encoder.processing encoder.

Page 12: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

12

Data Dependencies in H.264Data Dependencies in H.264Overview (1/2)Overview (1/2)

Reference software : JM 9.0Reference software : JM 9.0Sequential processing of MBsSequential processing of MBsData dependenciesData dependenciesProduce optimal bitstream in terms of coding Produce optimal bitstream in terms of coding

efficiency efficiency

highest compression ratiohighest compression ratio

Page 13: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

13

Data Dependencies in H.264Data Dependencies in H.264Overview (2/2)Overview (2/2)

ObjectiveObjectiveExplore elements of encoder that can be processed in Explore elements of encoder that can be processed in

parallel.parallel.Maximally exploit the temporal and spatial data Maximally exploit the temporal and spatial data

dependencies for optimal coding efficiency.dependencies for optimal coding efficiency.

Page 14: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

14

Data Dependencies in H.264Data Dependencies in H.264

Predicted Motion VectorPredicted Motion Vector In inter-prediction, PMV defines the search center of In inter-prediction, PMV defines the search center of

motion estimation.motion estimation.Useful in maintaining continuity of the motion field.Useful in maintaining continuity of the motion field. It is determined by the MVs of its neighboring It is determined by the MVs of its neighboring

subblocks and the corresponding reference indexes.subblocks and the corresponding reference indexes.

Page 15: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Intra-frame data dependenciesIntra-frame data dependencies

Only the difference (MVD) between the final optimal Only the difference (MVD) between the final optimal MV (MV’) and PMV will be encoded.MV (MV’) and PMV will be encoded.

15

Data Dependencies in H.264Data Dependencies in H.264

CurrentMB

MB A

MB D MB B MB C

PMV-MV'MVD

MVMVMVfPMV CBA

, ,

Page 16: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Inter-prediction and mode decisionInter-prediction and mode decisionH.264 needs the reconstructed images from encoded H.264 needs the reconstructed images from encoded

frames as reference to exploit temporal redundancy.frames as reference to exploit temporal redundancy.At least the co-located MB and its eight neighboring MBs At least the co-located MB and its eight neighboring MBs

must be available before current MB can be encoded.must be available before current MB can be encoded.

16

Data Dependencies in H.264Data Dependencies in H.264

Reference frame

Current frame

Page 17: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Quarter-pel interpolationQuarter-pel interpolationBefore the reconstructed result of current MB can be Before the reconstructed result of current MB can be

used as reference, it must be interpolated to get the used as reference, it must be interpolated to get the values in ½ and ¼ pel position.values in ½ and ¼ pel position.

Boundary area of current MB need 3 rows/cols of Boundary area of current MB need 3 rows/cols of pixels value from it’s neighboring MBs.pixels value from it’s neighboring MBs.

17

Data Dependencies in H.264Data Dependencies in H.264

Page 18: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Quarter-pel interpolationQuarter-pel interpolation

18

Data Dependencies in H.264Data Dependencies in H.264

C D

A B

E

K L M N O P

F G H I J

T U

R S

cc dd ee ff

aa

bb

gg

hh

ba ce f gi j kp q r

dhn

m

s

Page 19: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

4×4 and 16×16 intra-prediction & mode decision4×4 and 16×16 intra-prediction & mode decision

19

Data Dependencies in H.264Data Dependencies in H.264

Page 20: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Intra-prediction data dependenciesIntra-prediction data dependencies

20

Data Dependencies in H.264Data Dependencies in H.264

MB(i, j)MB(i, j-1)

MB(i-1, j)

Page 21: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Number of skipped MBs before current MBNumber of skipped MBs before current MB In H.264/AVC standard : In H.264/AVC standard : mb_skip_runmb_skip_run

Indicates how many MBs before current MB in raster- Indicates how many MBs before current MB in raster- scan order are skipped.scan order are skipped.

Needs to know the encoding status of previous MBs.Needs to know the encoding status of previous MBs.

21

Data Dependencies in H.264Data Dependencies in H.264

Page 22: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

MBs in different frames can be processed MBs in different frames can be processed concurrently, only if its necessary reconstructed concurrently, only if its necessary reconstructed MBs from reference frame are all available.MBs from reference frame are all available.

MBs from different MB rows in the same frame MBs from different MB rows in the same frame can be processed concurrently, only if its can be processed concurrently, only if its neighboring MBs in its top MB row all have been neighboring MBs in its top MB row all have been encoded and reconstructed.encoded and reconstructed.

22

Data Partition & Task PriorityData Partition & Task PriorityData Partition (1/5)Data Partition (1/5)

Page 23: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Concurrently processed MBsConcurrently processed MBs

23

Data Partition & Task PriorityData Partition & Task PriorityData Partition (2/5)Data Partition (2/5)

Frame num

ber

MBs which have already been encoded

MBs which are being encoded now

MBs which have not been encoded yet

Wavefront Parallelization

Page 24: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Wavefront Parallelization can achieve a constant Wavefront Parallelization can achieve a constant frame rate for any video format. (e.g..QCIF, CIF, frame rate for any video format. (e.g..QCIF, CIF, HDTV720).HDTV720).Sufficient number of processors.Sufficient number of processors.Video sequence is long enough.Video sequence is long enough.

24

Data Partition & Task PriorityData Partition & Task PriorityData Partition (3/5)Data Partition (3/5)

Page 25: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

ExampleExampleWith the increase of the frame With the increase of the frame

number, the average encoding number, the average encoding time for a frame approach time for a frame approach 44TMBTMB..

The number of processor units The number of processor units to needed to achieve this is :to needed to achieve this is :

25

Data Partition & Task PriorityData Partition & Task PriorityData Partition (4/5)Data Partition (4/5)

Frame num

ber

4/___ MBinsizeframePn

Page 26: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Each frame is partitioned into MB rows firstEach frame is partitioned into MB rows firstA MB can’t be processed until its left neighbor in the A MB can’t be processed until its left neighbor in the

same row is encodedsame row is encodedReduce data exchanges between processorsReduce data exchanges between processors

26

Data Partition & Task PriorityData Partition & Task PriorityData Partition (5/5)Data Partition (5/5)

Current Frame

………………

Page 27: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Task assignment timing diagramTask assignment timing diagram

27

Data Partition & Task PriorityData Partition & Task PriorityTask assigning and priorities (1/5)Task assigning and priorities (1/5)

t t+2T t+4T Task assigning schedule

Frame i, MB row jFrame i, MB row j + 1Frame i, MB row j + 2Frame i + 1, MB row j

Page 28: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

ExampleExample

28

Data Partition & Task PriorityData Partition & Task PriorityTask assigning and priorities (2/5)Task assigning and priorities (2/5)

Frame 1, MB row 1

Frame 1, MB row 2Frame 1, MB row 3Frame 2, MB row 1

Frame 1, MB row 4Frame 2, MB row 2

Frame 1, MB row 5Frame 2, MB row 3Frame 3, MB row 1

Frame 2, MB row 4Frame 3, MB row 2

Frame 2, MB row 5Frame 3, MB row 3Frame 4, MB row 1

Task assigning schedule

4 TMB

Page 29: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

To achieve optimal encoding speedTo achieve optimal encoding speedQCIF QCIF requires requires 2525 processors processorsCIF CIF requires requires 9999 processors processorsHDTV720 HDTV720 requires requires 900900 processors processors

29

Data Partition & Task PriorityData Partition & Task Priority Task assigning and priorities (3/5)Task assigning and priorities (3/5)

Page 30: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

In practice, we can’t have a large number of In practice, we can’t have a large number of processor unit.processor unit.

Priority based task schedulingPriority based task scheduling Define the priorities in two levelsDefine the priorities in two levels

Inter-frame levelInter-frame level Intra-frame levelIntra-frame level

30

Data Partition & Task PriorityData Partition & Task Priority Task assigning and priorities (4/5)Task assigning and priorities (4/5)

Page 31: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Inter-frame levelInter-frame level If several MBs belonging to different frames are ready If several MBs belonging to different frames are ready

to be encoded concurrently, the MBs in the frame with to be encoded concurrently, the MBs in the frame with smaller frame number should be encoded first.smaller frame number should be encoded first.

Intra-frame level Intra-frame level If several MBs belonging to different MB rows in the If several MBs belonging to different MB rows in the

same frame are ready to be encoded concurrently, the same frame are ready to be encoded concurrently, the MBs in the row with smaller row index should be MBs in the row with smaller row index should be encoded first.encoded first.

31

Data Partition & Task PriorityData Partition & Task Priority Task assigning and priorities (5/5)Task assigning and priorities (5/5)

Page 32: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

The wavefront simulator is developed in C The wavefront simulator is developed in C language and implemented in a PC with a P4 2.8 language and implemented in a PC with a P4 2.8 GHz processor and a 512MB memory.GHz processor and a 512MB memory.

The simulation results are compared with JM 9.0The simulation results are compared with JM 9.0 H.264 baseline profileH.264 baseline profile

Search range = Search range = ±10±10One reference frame, Hadamard transform, full R-One reference frame, Hadamard transform, full R-

D optimization, CAVLC entropy codingD optimization, CAVLC entropy coding

32

Experimental ResultsExperimental Results Overview (1/1)Overview (1/1)

Page 33: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

The relationship between the number of processors The relationship between the number of processors and the number of concurrently processed framesand the number of concurrently processed frames

33

Experimental ResultsExperimental Results

Page 34: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Theoretical processing time per frameTheoretical processing time per frame

34

Experimental ResultsExperimental Results

Page 35: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

Simulation resultsSimulation results

35

Experimental ResultsExperimental Results

Avg Encoding Avg Encoding time per frametime per frame

SnrYSnrY SnrUSnrU SnrVSnrV # of # of bytesbytes

Speed Speed upup

Wavefront simulatorWavefront simulator 273 ms273 ms 37.15737.157 39.86939.869 40.45040.450 6146461464 3.173.17

JM9.0JM9.0 865 ms865 ms 37.15737.157 39.86939.869 40.45040.450 6146461464 11

Avg Encoding Avg Encoding time per frametime per frame

SnrYSnrY SnrUSnrU SnrVSnrV # of # of bytesbytes

Speed Speed upup

Wavefront simulatorWavefront simulator 1272 ms1272 ms 35.72935.729 39.18139.181 39.27939.279 128419128419 3.083.08

JM9.0JM9.0 3914 ms3914 ms 35.72935.729 39.18139.181 39.27939.279 128419128419 11

Grandma.YUV (QCIF)

Paris.YUV (CIF)

Page 36: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

This paper presents the new Wavefront This paper presents the new Wavefront Parallelization method for H.264 encoder.Parallelization method for H.264 encoder.

Analysis and simulation results show that it Analysis and simulation results show that it can achieve the optimal compression at a can achieve the optimal compression at a frame rate that increases approximately frame rate that increases approximately linearly as the number of parallel processing linearly as the number of parallel processing elements.elements.

36

ConclusionsConclusions

Page 37: 1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006

[1] T.-C. Chen, Y.-W. Huang, and L.-G. Chen, "Analysis and design of macroblock pipelining for [1] T.-C. Chen, Y.-W. Huang, and L.-G. Chen, "Analysis and design of macroblock pipelining for h.264/avc vlsi architecture," in Proceedings of the 200>4 International Symtposium on Circuits h.264/avc vlsi architecture," in Proceedings of the 200>4 International Symtposium on Circuits and Systems, vol. 2, May 2004, pp. II-273-6and Systems, vol. 2, May 2004, pp. II-273-6

[2] Y.-W. Huang, T.-C. Chen, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, C.-S.Chen, C.-F. Shen, S.-Y. [2] Y.-W. Huang, T.-C. Chen, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, C.-S.Chen, C.-F. Shen, S.-Y. Ma, T.-C. Wang, B.-Y. Hsieh, H.-C. Fang, and L.-G. Chen, Ma, T.-C. Wang, B.-Y. Hsieh, H.-C. Fang, and L.-G. Chen, "A 1.3tops h.264/avc single-chip encoder for hdtv applications,” in IEEE Int. Conf.Solid-State Circuits, Feb 2005, pp. 128-130

[3] Y.-K. Chen, T. X, S. Ge, and G. M, "Towards efficient multi-level threading of h.264 encoder on intel hyper-threading architectures," in 18th Int.Parallel and Distributed Processing Symposium, Apr 2004, p.63

[4] S. M.Akramulah, I. Ahmad, and M. L.Liou, "Parallelization of mpeg-2 video encoder for parallel and distributed computing systems," in Proceedings of the 38th Midwest Symposium on Circuits and Systems, vol. 2, Aug 1995, pp. 834-837.

[5] P. Tiwari and E. Viscito, "A parallel mpeg-2 video encoder with look-ahead rate control," in Int.Conf: Acoustics, Speech, and Signal Processing, vol. 4, May 1996, pp. 1994-1997.

[6] K.Shen, L.A.Rowe, and E.J.Delp, "Parallel implementation of an mpeg-1 encoder: faster than real time," in SPIE, vol. 2419, Feb 1995, pp.407-418

37

ReferencesReferences