Forward error correction with Raptor GF(2) and GF(256) codes on GPU

L. Hu et al: Forward Error Correction with Raptor GF(2) and GF(256) Codes on GPU 273

Contributed Paper Manuscript received 01/07/13 Current version published 03/25/13 Electronic version published 03/25/13. 0098 3063/13/$20.00 © 2013 IEEE

Forward Error Correction with Raptor GF(2) and GF(256) Codes on GPU

Linjia Hu, Saeid Nooshabadi, Senior Member, IEEE, and Todor Mladenov, Member, IEEE

Abstract — Raptor Galois fields (GF(2)) code and its next

generation GF(256), are members of the rateless fountain codes family. Raptor codes have been a preferred technology for the forward error correction (FEC) at the application layer for several important consumer applications like Internet Protocol TV (IPTV). Raptor GF(256) code is proposed for reducing the redundant FEC information to a minimum. However, the improved coding performance comes at the expense of increased encoding and decoding complexity. On the other hand, graphics processing units (GPUs) have become a common place in the consumer market and are finding their way beyond graphics processing into general purpose computing. This paper investigates the suitability of GPU for Raptor GF(2) and GF(256) codes to process large block and symbol sizes in FEC transmission. The serial and parallel implementations of Raptor GF(2) and GF(256) codes are explored on the CPU and GPU platforms, respectively. Our work shows that efficient parallelization on the GPU improves the decoder performance significantly. To understand the performance bottlenecks of Raptor GF(2) and GF(256) codes on both the CPU and GPU platforms, the decoding speed is evaluated for different block and symbol sizes. Furthermore, simulations are performed for the practical realtime requirement in multimedia broadcast/ multicast service (MBMS) and digital video broadcasting (DVB) for the bearer speed of 21 MHz in high speed downlink packet access (HSDPA) network 1.

Index Terms — GPU, Raptor code, ECC, IPTV.

I. INTRODUCTION

Recently Raptor codes have gained acceptance in consumer electronics applications [1]-[9]. Raptor GF(2) code comes as an improvement to Luby transform (LT) codes, providing linear encoding and decoding time [10]. It outperforms the already well known coding schemes, for it can efficiently deliver variously sized data with minimum negotiation overhead. The Raptor GF(2) code have already been chosen for the forward error correction (FEC) scheme in the 3rd Generation Partnership Program (3GPP) multimedia broadcast/multicast service (MBMS) and digital video broadcasting (DVB) standards [11]-[12]. Recent works

1 Linjia Hu and Saeid Nooshabadi are with the Center of Computer System Research (CCSR), Michigan Tech, Houghton, MI, 49931, e-mail: ({linjiah, saeid}@mtu.edu)

Todor Mladenov is with Intel Mobile Communications, Am Campeon 10-12, 85521 Neubiberg, Germany, e-mail: ([email protected])

describe strategies for the design of Raptor GF(2) decoding in broadcast/multicast and IP-Datacast Services [2]-[3], [8]. There also have been some recent studies on the analysis of algorithmic and computational performance trade-offs of Raptor GF(2) decoding on the embedded systems [3], [9]. It is found that [2], [9] for very larger block size (≥ 4096) and symbols sizes (≥ 1024) for an MBMS system with a bearer speed of 384 kbs the computational complexity is beyond the capabilities of current embedded systems on a typical consumer electronic product.

Recently, the next generation of Raptor codes based on Galois fields 256 (GF(256)) [13] 2, has been introduced. It offers better coding performance in terms of reduced symbol overhead for guaranteeing successful decoding and support for larger source symbol block sizes. Raptor GF(256) code for application layer FEC is specified in RFC6330 [14]. However, Raptor GF(256) imposes even higher computational complexity on the decoding system [1] for large block and symbol sizes.

On the other hand, many-core graphics processing units (GPUs) have become a common place in the consumer market and are finding their way beyond graphics processing into general purpose data and digital signal processing applications as a compute-accelerate units, including consumer devices for variety of products [15]-[18]. Therefore, it is natural to employ many-core GPUs in the consumer electronics products for non-graphical applications. Many-core GPUs are massively data-parallel architecture computing units. Fortunately some of the algorithms for decoding of Raptor codes are amenable to massive parallelization and, therefore, suitable for the implementation on GPUs [18].

This paper is organized as follows. Section II briefly outlines the structure of Raptor GF(2) and GF(256) codes. Section III describes the implementation of Raptor GF(2) and GF(256) decoders on the CPU platform. Section IV explores their parallel implementations on the GPU. Section V evaluates the decoding performance and bottlenecks for the serial CPU and parallel GPU platforms. Section VI evaluates the Raptor GF(2) and GF(256) decoding performance for bearer speed of 21 MHz for broadcast systems such as High Speed Downlink Packet Access (HSDPA) specification in 3GPP [19] system. Finally, Section VII concludes this paper.

2 GF(256) Raptor codes is specified in the RFC6330 document [13] where

it is called RaptorQ codes.

274 IEEE Transactions on Consumer Electronics, Vol. 59, No. 1, February 2013

II. OVERVIEW OF RAPTOR GF(2) AND GF(256) CODES

The overview of Raptor GF(2) code is presented elsewhere [18]. Here, we only provide an overview for the Raptor GF(256) code. Raptor GF(256) forms the next evolution in the design of Raptor codes [20]. As an improvement to Raptor GF(2), Raptor GF(256) code provides superior reliability, better coding efficiency, and support for larger source block sizes.

Raptor GF(256) [13] is a linear block code, and therefore, can be represented by its generator matrices. This property facilitates an explanation of the principle of Raptor GF(256) code with block diagram without divulging into the theoretical details. The encoder has the capability to generate as many redundant encoding symbols as needed on the fly; the decoder recovers the source symbols from a set of slightly more encoded symbols, with a performance very close to that of an ideal binary erasure channel (BEC) code. Here, we provide a brief description of the encoding and decoding operations of Raptor GF(256) code. A block diagram of Systematic Raptor GF(256) Encoder and Decoder is shown in Fig. 1.

Fig. 1. Block diagram of systematic Raptor GF(256) encoder and decoder.

Let t be a vector of K source symbols that are to be encoded

(1 ≤ K ≤ 56,403). The size of each source symbol T varies from 1 to maximum of 65535 (216 - 1) bytes. An arbitrary vector t of size K is padded with zeros to a vector t' of size K'. This operation is performed by the ''Padding'' block in Fig. 1. The size of K' can be any value drawn from a subset of 477 source block sizes, distributed in the range from 1 to 56,403, for which Raptor GF(256) block code is defined. The mapping of K to K' minimizes the amount of table information that needs to be stored and enables faster encoding and decoding. Additionally, the padding zero symbols are not transmitted but they are a-priori known at the decoder and act as an additionally available parity information. Vector d, at the input of the pre-code block in Fig. 1, consists of L symbols: the K' padded source symbols in vector t' plus S + H zero symbols in vector z.

The pre-code matrix A encapsulates several submatrices. GLDPC1 and GLDPC2 are the generator matrices of two regular low density parity check codes (LDPC), defined over GF(2).

GHDPC, on the other hand, is the matrix of a high density parity check code (HDPC), defined over GF(256). This HDPC code matrix is the main difference that sets Raptor GF(256) code apart from Raptor GF(2) code (defined over GF(2)) and, to a greater extend, any other linear block code in use today. The first K' rows of the Luby Transform (LT) [21] GLT(1..K') matrix are included into the pre-code matrix A to render the whole Raptor GF(256) code systematic. IS and IH are identity matrices. The number of rows N (N ≥ K) in the LT encoder matrix GLT(1..N) are set according to the desired rate and the expected channel erasure probability. The Raptor GF(256) encoding process is performed according to (1), where the most time consuming operation is the inversion of matrix A.

1)..1(11

)..1(1 LNLTLLLNLTN cGdAGe (1)

The decoding process of Raptor GF(256) codes exchanges the positions of the pre-code matrix A and the GLT Encoder (to be used as GLT Decoder) matrix, as illustrated in Fig. 1 with GLT LT generator matrices appropriately sized. This allows for successful decoding when only the first K encoded symbols have been received and no errors are detected in the channel. The input vector e' is at first padded with (K' - K) zeros to produce symbol vector e'' containing N' (K ≤ N' ≤ N) encoded symbols (which may be nonconsecutive). Then vector e'' is further padded with S + H zeroes to form vector d' of size M (M = N' + S + H). Starting with N' = K the value of N' is iteratively incremented to make the matrix A invertible. The difference (N'-K) is equal to or greater than the number of received encoded symbols lost in the channel. The Raptor GF(256) decoding process is performed according to (2).

1)..1(

'1

)(1

)..1(1 '''

LKLTMTLMKLTK

cGdAGt (2)

The pre-code matrix A in the Raptor GF(256) encoder has to be inverted only once, for a fixed source block size K. On the other hand, the pre-code matrix A in the Raptor GF(256) decoder has to be inverted with every decoded block of data.

III. IMPLEMENTATION ON CPU

The systematic Raptor GF(2) and GF(256) codes are implemented strictly according to the 3GPP and DVB-H standards on both the CPU and GPU, with details as defined, respectively, in IETF RFC5053 [20] and in IETF RFC6330 [14]. In the entire system for both Raptor codes, the decoder is the major performance bottleneck, due to its computational complexity. The bottleneck is further increased for the large block and symbol sizes that are common in the FEC transmission. The data profiling results show that matrix inversion operations consume more than 90% of the decoding time. Hence, we concentrate our efforts on the optimization of matrix inversion algorithm for the preprocessor.

The specifications of Raptor GF(2) and GF(256) codes [13]-[14], [20] suggest the use of inactivation decoding Gaussian elimination (IDGE) decoding algorithm. This


technique combines the low-complexity of belief-propagation (BP) with the decoding guarantee of Gaussian elimination (GE) to reduce the number of computational steps. The vector decoding is performed in parallel with the matrix inversion. However, while IDGE requires less number of steps than GE, its complex features are less amenable to parallelization. On the other hand, the most common GE algorithm for matrix inversion is well suited for implementation on a parallel architecture.

The pseudo code for GF(256) GE algorithm is shown in Algorithm I. The main steps involved in Algorithm I are row-wise exchange, division and exclusive-or (XOR) operations. The structure and the simplicity of this algorithm allows for greater design freedom, optimization and parallelization on a CPU platform and to a much greater extent on a parallel GPU platform.

The preprocessor A matrix is a sparse matrix. Three typical

memory mapping schemes for the representation of a sparse preprocessor matrix have been proposed [9]. The first memory mapping scheme is denoted as ``UNPACKED'', which employs a single 32-bits word to store one matrix element; the second scheme is named ``PACKED'', where four elements are packed together into a single 32-bits word; and the last mapping scheme is designated as ``SPARSE'', in which the sparse matrix is represented in a compressed form through the use of a linked list. The ``SPARSE'' version is not suited for parallelization, and while it saves memory, it does not yield to an efficient implementation.

IV. PARALLEL IMPLEMENTATION ON GPU

The simplicity of the GE algorithm makes it extremely suitable for implementation on a parallel architecture like the GPU. Raptor GF(2) code were parallelized [18] using the GPU programming model [22]. To exploit parallelism a GPU kernel3 is organized as a grid of one to three dimensional blocks, with each block structured as one to three dimensional array of threads. This logical programming model enables development of programs that can run thousands of threads across different generations of GPUs regardless of the number of physical cores [22]. We designed a parallel implementation of GE algorithm to take advantage of the massive parallelism offered by the GPU through this programming model.

In the analysis of the implementation of Raptor GF(2) and GF(256) codes on the GPU platform we make the key observation that in addition to the computational complexity, their performance behaviors are determined by several other factors. These factors include global memory issues such as coalescing, shared memory bank conflicts, overhead of synchronization and branch divergence. The discussion that follows illustrates our efforts in the implementation and parallel optimization of GE algorithm on the GPU, taking those factors into account.

The data profiling results of Raptor GF(2) and GF(256) codes on the CPU show that the most time consuming portions of the GE algorithm are forward reduction and backward substitution. Both of these operations are fortunately amenable to massive parallelization on the GPU. In this paper we concentrate on the algorithm for Raptor GF(256). The matrix inversion algorithm of Raptor GF(2) code is similar to a previous published work [18].

A. Parallel Algorithm Development

The description of our parallel implementation is presented in Algorithm 2. It consists of two loops for the forward reduction and backward substitution. These Two loops together form a total of six GPU device kernel launches, three CPU function calls, two device to host memory transfers, and two host to device memory transfers. In addition, there are the initial and the final memory transfers between the host and the device and vice versa. The forward reduction loop in Algorithm 2 consists of four device kernels, two host functions, one device to host and one host to device memory transfers. Before the GPU devices can perform the decoding, the preprocessor matrix and the data payload are transferred from the host memory to the GPU global memory, for parallel processing. Two main computational stages in the forward reduction involve row swapping and XOR operations which are performed in kernels 2 and 4, respectively. Two sequences of computing steps consisting of a device to memory copy, a GPU kernel, a host compute function before Kernel 2 and Kernel 4 collectively serve as a preprocessing for those kernels. The first sequence

3 Kernel in GPU programming terminology in [22] refers to functions that

run on the GPU device.


of preprocessing steps before Kernel 2 is used to find the pivot. The second sequence of preprocessing steps before Kernel 4 discovers the exact rows that need to take part in the forward reduction. While it is possible to perform all the preprocessing steps in the GPU kernels, due to sequential or semi-sequential nature of the operations involved, and the fact that the matrix is sparse, the performance of these functions is significantly reduced on the GPU due to sequential processing and issues like branch divergence performance problems.

In the sequence of operations in Kernel 1 in Algorithm 2, part of the current column j with the row indices ranging from (i) to (n - 1) are collected on the device and copied to the host. On the host side, Kernel 2 is conditionally launched if the first element of the collected data is zero.

Kernel 2 in Algorithm 2, as shown in Fig. 2, swaps the current row i with the target row k. As the data swapping is element-wise, fine grain parallelism can be exploited to perform this operation, with each thread operating on one element of matrix A or a word size chunk of vector d. In this kernel, the content of the current row is physically swapped with that of the target row. The alternative is to swap the pointers for the relevant rows of matrix A and vector d. This technique corresponds to allocating a redirection array in the global memory. All accesses to the elements of matrix A and vector d are redirected through this array. This is a more efficient method as only two pointers in this redirection array need to be updated at each iteration. While the pointer method is more efficient as far as the Kernel 2 is concerned, the overhead of extra redirection operations in the global memory significantly impairs the performance of Kernel 4. Therefore, in our implementation we simply swap the contents of two rows in the global memory.

The next step in Algorithm 2 is Kernel 3. In this kernel all the elements in the current row are divided by the pivot element, if it is greater than one.

Next, before the invocation of Kernel 4 the number of rows required for the forward reduction (rows with elements containing non-zero elements in column j) are counted and their corresponding row indexes are recorded on the host. The data profiling show that A is a sparse matrix, so the number of rows that participate in the forward reduction is very small. This preprocessing discovers the exact rows that need to take part in the forward reduction in Kernel 4, and, thereby, reduces the occurrences of branch divergence that otherwise would hamper the performance of this Kernel. As an alternative, the checks for the rows that participate in the forward reduction can be performed in a parallel fashion directly on the GPU. However due to the overhead of branch divergence and thread synchronization the performance of a GPU only code is inferior to a combined GPU/CPU code.

Kernel 4 in Algorithm 2, as depicted in Fig. 3 corresponds to the forward reduction in GE algorithm, where elements of the current row i are "bit-wise exclusive-ored" with the elements of the selected rows. As said before, forward reduction is the most time consuming part of the GE.


B. Performance Considerations

To improve the performance of forward reduction, three major optimizations techniques are considered. The first optimization technique involves a major reduction in the number of the global memory accesses. We note that the values of the first j elements of each reduced row are updated to zeroes after each reduction operation. This means only the next (n - j) elements in each row need to participate in the forward reduction operation in the next iteration. To avoid accessing the row elements that are already turned zero in the previous iterations, the number of the row elements involved in the forward reduction in each iteration changes dynamically. To further reduce the memory accesses, the implementation avoids the memory update after an exclusive-OR operation if the current operand is detected as zero. This works well as the matrix is sparse.

Fig. 2. Kernel 2 of Parallel GE.

As a second global memory optimization technique, coalescing is employed to combine the thread memory accesses into a single transaction. It is achieved by ensuring that in each row operation, the read/write memory addresses issued by the threads in a block are consecutive and aligned. As seen from Fig. 2 and Fig. 3, as the column j moves from left to right the number of row elements accessed in the row reduction operation reduces. To enforce memory alignment, the reduction in row element accesses are made in steps of 32. Although, on average, this requires 16 redundant memory accesses, the performance gain obtained through memory alignment far outweighs the excess parallel reads.

The last optimization involves the block and grid configuration. This configuration influences the performance substantially. The data profiling shows that the best arrangement for the forward reduction on the GPU platform is one-dimensional 256 thread blocks arranged in a two-dimensional arrangement of blocks in a grid. In this layout, each block only reduces one segment of a row of the Matrix A or vector d. This arrangement also ensures memory coalescing and alignment. Note that in this implementation each element is used only once, and there is no scope for data reuse.

However, the advantage is that there is no need for thread synchronization, and the threads can execute faster.



The backward substitution loop in Algorithm 2 consists of

two device kernels, one host function, one device to host and one host to device memory transfers. Kernel 5 combined with the host function and memory transfers are similar to the preprocessing in the forward reduction loop. Similarly, Kernel 6, as depicted in Fig. 4 is back substitution equivalent of the Kernel 4. After the forward reduction, matrix A is transformed into an upper triangular matrix. The operations in the backward substitution are primarily for updating the content of vector d. In addition, in each iteration only one column of matrix A participates in the backward substitution. Therefore, we only need to search for elements of the current column of matrix A that are non-zero and make a record of them. This record identifies the rows of vector d that need to participate in the backward substitution. As we go through this preprocessing, the indices of the relevant rows of d are identified and are copied to the device. Kernel 6 performs the actual backward substitution.


V. DECODING PERFORMANCE

This section presents the experimental results for the serial and parallel Raptor GF(2) and GF(256) codes on the CPU and GPU platforms.

Both the serial and parallel implementations of Raptor GF(2) code are based on the IETF RFC 5030. Recommended by the 3GPP and DVB-H standard, the maximum block size is 8192 symbols, and the maximum symbol size is 32468 bytes. In the experiments, we only study some sample cases from the selection of block sizes K = 128, 256, 512, 1024, 2048, 4096 and 8192, and symbol sizes T = 128, 256, 512, 1024, 2048, 4096, 8192, 16384, and 32768 bytes.

As for the Raptor GF(256) code, its serial and parallel implementations are based on the IETF RFC 6330. The recommended maximum block size is 65403 symbols (8 times larger than Raptor GF(2)), and the maximum symbol size is 32468 bytes. In our experiments, we only study some sample cases where K = 128, 256, 512, 1024, 2048, 4096 and 8192, T = 128, 256, 512, 1024, 2048, 4096, 8192, 16384, and 32768 bytes.

Fig. 5. Raptor GF(2) Speedup of CPU-PACKED-GE/GPU-PACKED-GE.

We developed three decoding algorithms on the CPU for

both Raptor GF(2) and GF(256) codes. The first one is CPU-PACKED-GE, in which the preprocessing matrix is compressed and the matrix inversion algorithm is GE; the second one is CPU-UNPACKED-GE, in which the preprocessing matrix is not compressed and the matrix inversion algorithm is GE; the last one is CPU-SPARSE-IDGE, the preprocessing matrix is represented by linked list and the matrix inversion algorithm is SPARSE-IDGE. On the other hand, on the GPU platform only GPU-PACKED-GE, and GPU-UNPACKED-GE, algorithms were implemented. SPARSE-IDGE, due its non-parallelizable feature, was not implemented on the GPU.

A. Raptor GF(2) Code

For Raptor GF(2) code, CPU-PACKED-GE and GPU-PACKED-GE achieve the fastest decoding speeds on the CPU and GPU platforms, respectively, for all combinations of K and T [18]. Fig. 5 depicts the speedup of GPU-PACKED-GE over CPU-PACKED-GE for a range K and T values. As seen, for a given value of K the speedup increases as (log2(T))2.5. For a given value of T the speedup also increases with the matrix size K at the same rate of (log2(K))2.5 when K is small and saturates for large values of K. The maximum speedup reaches to 305 for K = 4096 and T = 32768.

Fig. 6. Raptor GF(256) Speedup of CPU-PACKED-GE/GPU-PACKED-GE.

B. Raptor GF(562) Code

As for Raptor GF(256) code, the gap between decoding speed of CPU-UNPACKED-GE and that of CPU-PACKED-GE is small for all combinations of K and T, with CPU-PACKED-GE performing slightly better for most test cases. This is obviously different from the results obtained for Raptor GF(2) code [18]. That is because, in Raptor GF(256) code, the matrix elements ranging from 0 to 255 require one byte of storage and we can only compress four elements into one word, and therefore, considering the overhead involved in extracting the symbols from the compressed word, the word level parallelism does not pay off. Similar results are observed for GPU-PACKED-GE and GPU-UNPACKED-GE algorithms on the GPU.

However, on the CPU platform CPU-IDGE algorithm performs the best and, therefore, we limit the discussion to the comparison of CPU-PACKED-GE, and CPU-SPARSE-IDGE algorithms on the CPU with that of GPU-PACKED-GE on the GPU.

Fig. 6 depicts the speedup of GPU-PACKED-GE over CPU-PACKED-GE. The speedup ratios are higher in comparison with Raptor GF(2) code, even though the plots exhibit similar behaviors with respect to variations in T and K. The maximum speed up, reaches up to 354 for K = 4065 and T=32768.


Fig. 7. Raptor GF(256) Speedup of CPU-SPARSE-IDGE/GPU-UNPACKED-GE.

Fig. 7, on the other hand, presents the speedup of GPU-

PACKED-GE over CPU-SPARSE-IDGE. The speedup behavior, however, is unlike the other two cases. For the block size values of K less than 2048, the speedup remains constant with respect to T. That is due to the power of the IDGE algorithm that removes the matrix inversion as a performance bottleneck, for small to moderately sized matrix. Hence, the speedup is purely due to parallel operations on the symbol vector, and therefore, independent of T. For large values of K the time consumed in the matrix inversion becomes significant, and parallel operations on the GPU clearly win over dramatically. The maximum speedup observed approaches 162 for K = 8192 and T = 32768.

VI. RAPTOR GF(256) CODE PERFORMANCE FOR THE

BROADCAST SYSTEM

Next, we analyze the suitability of the best performing CPU (CPU-SPARSE-IDGE), and GPU (GPU-PACKED-GE) algorithms for use in broadcast applications with a typical bearer speed of 21 Mbps, and with an erasure rate of 20%. Tables I through IV present the slack times for several block and symbol sizes, respectively, for Raptor GF(2) and GF(256) codes in order to maintain the realtime requirement of 21 Mbps bearer speed. The positive values indicate the slack time left before the decoding of the next received block of data is due to start. The negative values (highlighted in red), on the other hand, indicate the decoding time over budget.

Considering the data in Table I and Table II, it is clear that except for block sizes of less than K = 526 the decoding fails to meet the realtime requirement of 21 Mbps bearer speed on the CPU platform. On the other hand, the decoding on the GPU platform only fails for four cases.

For the case of Raptor GF(256) in Table III and Table IV the decoding on the CPU platform fails for more than half of the cases, while there are only a handful of fail cases on the GPU platform.

To sum up, for both Raptor GF(2) and GF(256) codes CPU-PACKED-GE decoding algorithm on the CPU cannot satisfy the realtime requirement for the HSDPA broadcast system for most of the test cases. However, the GPU-PACKED-GE decoding algorithm on the GPU satisfies the realtime requirement except for few cases.

TABLE I

DECODING SLACK TIME OF RAPTOR GF(2) CODE WITH BEARER SPEED OF

21 MBPS: RAPTOR-GF(2)-CPU-PACKED-GE K\T 128 256 512 1024 2048 4096 8192 16384 32768 128 0.01 0.01 0.02 0.04 0.08 0.16 0.28 0.64 1.10 256 0.00 0.01 0.03 0.04 0.08 0.16 0.25 0.37 -0.20 512 -0.01 -0.01 -0.01 -0.04 -0.07 -0.17 -0.91 -3.66 -11.25

1024 -0.05 -0.10 -0.15 -0.35 -0.83 -2.27 -7.86 -19.97 -47.09 2048 -0.37 -0.54 -1.09 -2.23 -6.90 -18.57 -42.19 -88.8 -187.19 4096 -1.57 -2.57 -5.62 -13.54 -31.02 -65.67 -140.39 -294.94 -550.68 8192 -11.54 -19.05 -30.88 -57.24 -110.17 -226.07 -444.67 -852.87 -1800.97

TABLE II

DECODING SLACK TIME OF RAPTOR GF(2) CODE WITH BEARER SPEED OF

21 MBPS: RAPTOR-GF(2)-GPU-PACKED-GE K\T 128 256 512 1024 2048 4096 8192 16384 32768 128 0.00 0.01 0.03 0.06 0.13 0.27 0.54 1.08 2.17 256 0.00 0.02 0.05 0.12 0.25 0.50 1.02 2.06 4.12 512 0.00 0.03 0.10 0.22 0.48 0.98 1.99 4.00 8.02

1024 0.00 0.06 0.19 0.44 0.93 1.93 3.90 7.87 15.79 2048 -0.03 0.10 0.35 0.85 1.82 3.78 7.70 15.53 31.19 4096 -0.14 0.11 0.62 1.59 3.52 7.41 15.17 30.66 61.60 8192 -0.74 -0.26 0.75 2.55 6.42 14.07 29.35 59.57 119.79

TABLE III DECODING SLACK TIME OF RAPTOR GF(256) CODE WITH BEARER SPEED

OF 21 MBPS: RAPTOR-GF(256)-CPU-PACKED-IDGE K\T 128 256 512 1024 2048 4096 8192 16384 32768 128 0.00 0.01 0.03 0.06 0.13 0.26 0.53 1.07 2.16 256 -0.01 -0.00 0.03 0.10 0.23 0.48 0.99 1.99 4.01 512 -0.09 -0.10 -0.03 0.10 0.35 0.83 1.78 3.51 7.29

1024 -0.42 -0.49 -0.46 -0.19 0.32 1.32 3.03 6.04 13.09 2048 -2.55 -2.72 -2.78 -2.05 -1.38 1.09 4.01 9.99 23.06 4096 -12.53 -12.02 -11.49 -10.33 -8.73 -3.45 -5.84 -49.94 -149.13 8192 -59.72 -58.99 -58.65 -64.28 -80.11 -155.95 -815.94 -1184.19 -2920.6 16384 -291.9 -280.3 -364.3 -449.62 -567.93 -874.23 -3023.9 -5160.6 -6733.7

TABLE IV

DECODING SLACK TIME OF RAPTOR GF(256) CODE WITH BEARER SPEED

OF 21 MBPS: RAPTOR-GF(256)-GPU-PACKED-GE

K\T 128 256 512 1024 2048 4096 8192 16384 32768 128 0.00 0.01 0.03 0.06 0.13 0.26 0.53 1.08 2.16 256 0.00 0.02 0.05 0.11 0.24 0.50 1.01 2.04 4.09 512 0.00 0.03 0.09 0.21 0.46 0.96 1.95 3.92 7.88

1024 0.02 0.05 0.17 0.41 0.89 1.86 3.80 7.67 15.41 2048 -0.08 0.04 0.29 0.76 1.71 3.60 7.38 14.97 30.11 4096 -0.47 -0.30 0.21 1.14 2.97 6.66 14.10 28.83 58.22 8192 -2.92 -2.52 -1.49 1.11 3.70 10.67 24.81 53.28 109.5316384 -15.76 -14.87 -12.83 -9.93 3.81 11.78 35.24 89.81 202.87

VII. CONCLUSION

In this paper, we developed parallel GE algorithms for the Raptor GF(2) and GF(256) codes on a GPU platform and measured their performance with respect to the best serial algorithms on a CPU. The experimental results show that the decoding speeds of the parallel GPU Raptor GF(2) and GF(256) codes are far superior to their serial counterparts on the CPU for large block and symbol sizes. Further, the parallel


implementation of Raptor GF(2) and GF(256) codes on the GPU satisfy the realtime requirement of HSDPA broadcast for MBMS and DVB with a bear speed of 21 Mbps.

REFERENCES

[1] T. Mladenov, S. Nooshabadi, and K. Kim, “Efficient GF(256) Raptor code decoding for multimedia broadcast/multicast services and consumer terminals,” IEEE Trans. Consum. Electron., vol. 58, no. 2, pp. 356-363, May 2012.

[2] T. Mladenov, S. Nooshabadi, and K. Kim, “Strategies for the design of Raptor decoding in broadcast/multicast delivery systems,” IEEE Trans. Consum. Electron., vol. 56, no. 2, pp. 423-428, May 2010.

[3] T. Mladenov, S. Nooshabadi, and K. Kim, “MBMS Raptor codes design trade-offs for IPTV,” IEEE Trans. Consum. Electron., vol. 56, no. 3, pp. 1264-1269, Aug. 2010.

[4] E.-S. Ryu and N. Jayant, “Home gateway for three-screen TV using H.264 SVC and raptor FEC,” IEEE Trans. Consum. Electron., vol. 57, no. 4, pp. 1652–1660, Nov. 2011

[5] K. Noh, S. Yoon, and J. Heo, “Performance analysis of network coding with raptor codes for IPTV,” IEEE Trans. Consum. Electron., vol. 55, no. 1, pp. 83–87, Feb. 2009.

[6] J. Heo, S. Kim, J. Kim, and J. Kim, “Low complexity decoding for Raptor codes for hybrid-ARQ systems,” IEEE Trans. Consum. Electron., vol. 54, no. 2, pp. 83–87, May 2008

[7] S. W. Kim, S. Y. Kim, S. Kim, and J. Heo, “Performance analysis of forward error correcting codes in IPTV,” IEEE Trans. Consum. Electron., vol. 54, no. 2, pp. 376–380, May 2008.

[8] T. Mladenov, S. Nooshabadi, and K. Kim, “Efficient incremental Raptor decoding over BEC for 3GPP MBMS and DVB IP-Datacast services,” IEEE Trans. Broadcast., vol. 57, no. 2, pp. 313–318, Jun. 2011.

[9] T. Mladenov, S. Nooshabadi, and K. Kim, “Implementation and evaluation of Raptor codes on embedded systems,” IEEE Trans. on Comput., vol. 60, no. 12, pp. 1678–1691, Dec. 2011.

[10] A. Shokrollahi, “Raptor codes,” IEEE Trans. Inf. Theory, vol. 52, pp. 2551–2567, Jun. 2006.

[11] 3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; Multimedia Broadcast/Multicast Service (MBMS); Protocols and codecs 3GPP TS 26.346 V7.4.0 Release 7, Jun. 2007.

[12] Digital Video Broadcasting (DVB); IP Datacast over DVB-H: Content Delivery Protocols, ETSI TS 102 472 V1.3.1, Jun. 2009.

[13] RaptorQ Forward Error Correction Scheme for Object Delivery, IETF Proposed Standard, RFC 6330, Aug. 2011.

[14] Raptor Forward Error Correction (FEC) Schemes for FECFRAME, IETF Proposed Standard, RFC6681, Aug. 2012.

[15] L. Do, G. Bravo, S. D. W. Zinger, and P.H.N., “GPU-accelerated realtime free-viewpoint DIBR for 3DTV,” IEEE Trans. Consum. Electron., vol. 58, no. 2, pp. 633–640, May 2012.

[16] S. H. Lee and S. Sharma, “Real-time disparity estimation algorithm for stereo camera systems,” IEEE Trans. Consum. Electron., vol. 57, no. 3, pp. 1018–1026, May 2011.

[17] H.-C. Shin, Y.-J. Kim, H. Park, and P. J.-I., “Fast view synthesis using GPU for 3D display,” IEEE Trans. Consum. Electron., vol. 54, no. 4, pp. 2068–2076, Nov. 2008.

[18] L. Hu, T. Mladenov, and S. Nooshabadi, “Implementation and evaluation of Raptor code on GPU,” in Proc. of IEEE Int. Sym. Consum. Electron. (ISCE), Harrisburg, PA, June 2012.

[19] 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; High Speed Downlink Packet Access (HSDPA); Overall description; Stage 2, (Release 7), 3GPP TS 25.308 V7.10.0, Jun. 2009.

[20] Raptor Forward Error Correction Scheme for Object Delivery, IETF Proposed Standard, RFC5053, Sep. 2007.

[21] M. Luby, “LT codes,” in 43rd Annu. IEEE Symp. on Found. of Comput. Sci., Vancouver, BC, Canada, Nov. 2002, pp. 271–280.

[22] NVIDIA CUDA C Programming Guide, ver. 4.2 NVIDIA Co, Santa Clara, CA, 2012.

BIOGRAPHIES

Linjia Hu received the B.S in Electronic Engineering from Central South University, Changsha, China, in 2006, and M.E in Software Engineering from University of Science and Technology of China, Heifei, China in 2009. He was with ZTE Corporation, Shenzhen R&D center, China, from 2007 to 2009. Currently, he is a PhD candidate in the Department of Computer Science, Michigan Technological

University. His research interests include GPU and multicore computing, real time embedded system design, point clouds and 3D mesh processing, digital image processing, clock synchronization in distribute system.

Saeid Nooshabadi (M’01–SM’07) received MTech and PhD degrees in electrical engineering from the India Institute of Technology, Delhi, India, in 1986 and 1992, respectively. Currently, he is the professor of Computer Systems Engineering, having a joint appointment, with Departments of Electrical & Computer Engineering, and Computer Science, Michigan Technological University, Michigan. Prior to his current appointment he has held

multiple academic and research positions. His last two appointments were with the Gwangju Institute of Science and Technology, Republic of Korea (2007 to 2010), and with the University of New South Wales, Sydney, Australia (2000 to 2007). His research interests include VLSI information processing and low-power embedded processors for wireless network and biomedical applications.

Todor Mladenov (M’06) received the B.Eng. degree in communications and communications technologies from Technical University of Sofia, Bulgaria, in 2005, and the M.S. and the Ph.D. degrees in information and communications from Gwangju Institute of Science and Technology (GIST), Republic of Korea, in 2007 and 2011, respectively. He worked as a researcher and

lecturer from 2011 to 2012 at GIST with the School of Information and Communications, and with the Department of Nanobio Materials and Electronics. In 2012 he joined the processor subsystem group at Intel Mobile Communications. His research interests include microprocessor architectures, network-on-chip (NoC), and design of low power, high speed application specific circuits and systems for multimedia, communications and information theory.

Documents

Forward error correction with Raptor GF(2) and GF(256) codes on GPU