FPGA vs GPU Performance Comparison on the Implementation of FIR Filters

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters

GPGPU platforms

• GP - General Purpose computation using GPU• GPU - Graphics Processing Unit

• CUDA and OpenCL are both frameworks for task-based and data-based general purpose parallel execution. Their architectures show a great similarity. The key difference between these two frameworks is that OpenCL is a cross-platform framework (implemented on CPUs, GPUs, DSPs and etc.); whereas, CUDA is supported only by NVIDIA GPUs

Memory Hierarchy• Memory hierarchy of GPGPU architectures show similarity with CPU memory• At the bottom level of the memory hierarchy the slowest but the largest

capacity memory type resides. This type of memory is named as global memory in CUDA terminology. A typical global memory is 2 or 4 gigabyte size and resides outside of the GPU chip

• Global memories are usually manufactured using DRAM• Constant memory is another memory type in CUDA devices and is optimized

for broadcasting operations, so that it can be accessed faster than global memory.

• Like caches in CPU memory hierarchy, there is a faster but smaller memory type in CUDA memory hierarchy called as shared memory.

• Registers are other storage units in CUDA memory hierarchy which are private for each thread. Registers have the smallest latency and maximum throughput, but their amount is very limited.

Memory Hierarchy

CUDA Memory Hierarchy

Filter Overview FIR • FIR filter structure is constructed from its transfer function and linear difference equation which is obtained from taking inverse Z-transform of the transfer function of the filter

Filter Overview FIR

• The output stream y(n) is calculated by multiplying the input signal [x(n), x(n-1), … x(n-M+1)] with the corresponding filter coefficients [b0, b1, … ,bM-1] and adding all the multiplication results together.

Filter Overview FIR

GPU Implementation

• Three different implementation techniques are designed to compare the performance of GPUs with the FPGAs.

• Two of the designs are implemented using CUDA.• The other design is an OpenCL kernel implementation.• The first CUDA design is a naïve and simple kernel that

does not include any significant optimization. • The other optimized CUDA kernel uses shared memory

and coalesce global memory accesses. • The third one is just an OpenCL version of the highly

optimized CUDA FIR filter implementation.

Basic CUDA FIR Filter

FPGA Implementation

• Three different implementation techniques are selected to synthesize FIR filters on various FPGAs: Direct-form, symmetric-form, and distributed arithmetic.

• It is possible to achieve massive level of parallelism by utilizing multiplier sources of the FPGAs. Most Xilinx FPGAs have DSP48 macro blocks embedded in their chips. These slices have 18x18-bit multiplier units with pre-accumulator, 48-bit accumulators and selection multiplexers in order to speed up DSP operations.

• For the direct-form and symmetric-form FPGA implementations Xilinx’s DSP48 macro slices are utilized.

FPGA Implementation• Distributed arithmetic (DA) technique is an efficient method for

implementing multiplication operations without using the DSP macro blocks of the FPGA. In the DA technique, the coefficients of the FIR filter is represented in two’s complement binary form and all possible sum combinations of the filter coefficients are stored in look-up tables (LUT). Using classical shift-adder method the multiplication operation can be performed effectively without using multiplier units of the FPGA. We used 4-input LUTs of the FPGA to implement the DA form of the FIR filter structure.

• We chose three different FPGAs to compare the performance results of the FIR filters. Utilized FPGAs and their properties are given in Table 1. Xilinx ISE v14.1 software is used to synthesize the circuits.

Results and Discussions

GPU and CPU Performance Results of FIR Filter Application (Million Samples per Second)

Conclusions • FIR filter order has a noticeable effect on performance.• For lower order FIR filters both FPGA and GPU achieved better

performance with respect to higher order FIR filters.• Serialization due to the lack of enough multiplier units is the main

performance decrease reason for FPGAs.• Logic resource capacity of an FPGA is another limiting factor to

implement high order FIR filters• FPGAs have relatively lower prices than GPUs, yet GPUs enjoy the ease

of programmability where FPGAs are still tough to program.• In general, FPGA performance is higher than GPU when the FIR filter is

fully parallelized on FPGA device.• However, GPU outperforms FPGA when the FIR filter has to be

implemented with serial parts on FPGA.

Questions?

• No!

Documents

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters