Upload
harvey-dean
View
214
Download
1
Embed Size (px)
Citation preview
Implementation of MAC Assisted CORDIC engine on FPGA
EE382N-4 Abhik Bhattacharya
Mrinal DeoRaghunandan K R
Samir Dutt
Motivation• The TLL 5000 Freescale i.MX21 System-on-Chip
ARM9-based processor does not have native support for Floating Point
• Floating point operation simulated using libraries e.g libc• Applications which are “Math Heavy” e.g MAC based
operations which require computing sine/cos/arctan values are thus not suitable for this platform.
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 2
Hardware Acceleration for Trigonometric Math
operations
Outline • Select a basic mathematical building block. E.g
CORDIC (from OpenCores)• Implement the CORDIC engine in hardware (FPGA). • Implement higher level primitives e.g Discrete
Fourier Transform, using CORDIC.• Use these blocks in a C program instead of the
<math.h>. • Offload the heavy number crunching to the
hardware accelerator (FPGA) freeing up valuable CPU resources.
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 3
CORDIC engine
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 4
• Coordinated Rotation Digital Computer is simple and efficient algorithm to calculate hyperbolic and trigonometric functions.
• We use it to calculate Sine and Cosine of an angle given in Radians/Degrees .
• To determine the Sine and Cosine of angle β we need to find the position X and Y on the unit circle.
CORDIC contd.• CORDIC is an iterative algorithm and used table lookup. • First Step: Rotate the vector 45° counterclockwise. • If ((β – α) != 0)
iterateElse
exit.• Successive iteration will rotate the vector in one or the other
direction in size decreasing steps. • The magnitude of rotation is 1/2i.
– Where “i” is the iteration step. • Terminate after 16 steps. (approximate 5 digits of
precision)EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt 5
Discrete Fourier Transform(DFT)
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 6
DFT can be implemented using CORDIC
Design of CORDIC• The CORDIC Verilog from OpenCores could be operated in
different modes– Pipelined– Iterative– Combinatorial
• Pipeline Efficient from performance perspective. We trade off area for performance. (max number of LUT needed)– Outputs result at every clock after an initial latency.
• Resolution limited to 5 bits of precision• Algorithm works in the 1st Quadrant of the unit circle.
Appropriate logic added to take care of the polarityEE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt 7
MAC Implementation
• Pipelined CORDIC gives Sin/cosine values in every cycle if we can maintain steady inflow of inputs.
• Can implement a MAC based engine based on this CORDIC functionality.
• Useful in Linear Time variant Control Systems where the coefficients may be sine/cosine values which need to be computed & accumulated
• Simple example: Discrete Fourier Transform
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 8
Design of DFT• 32 point of DFT implemented using CORDIC
based MAC.• Samples sent to the board from the user
application. • Instantiated one copy of the Cordic based
MAC.• The design was pipelined to avoid any bubbles
providing new input (angle) to the CORDIC every cycle.
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 9
Block Diagram of our System
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 10
CORDIC
DFT
sin (θ) cos (θ)
(θ) CORDIC Gain
MAC Engine
Input Samples
DFT
Top Level
Operation of the System
• User Application writes the 32 data samples to the RAM followed by a “compute_dft” instruction.
• Data is read from the RAM by the DFT encoder in a pipeline.• Handshaking between two pipelined stages.
– MAC operation begins after a delay of 16 clks (initial latency of CORDIC pipeline). – 1st MAC output generated after N clocks after the initial Latency. (N == 32) is length of the
input sequence.– After MAC generates N output samples, the result of the N-point DFT is written to the
RAM module followed by an Interrupt.– User application reads the results from the RAM through the device driver on detection
of this interrupt.EE382N-4 Abhik Bhattacharya,Mrinal Deo,
Raghunandan R.K, Samir Dutt 11
User application writes i/p to RAM
Initial CORDIC latency
Time ------ >
MAC Operation begins 1st MAC output sample
N
Final o/p from MAC
Performance Measurements
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 12
Issues Faced• Coding a aggressive pipeline (avoid bubbles) is always
a challenge. • Time consuming process – needs to be done in 2 steps
– Code and validate in ModelSim (signals available for debug)– Change the design to run in it on FPGA. Iterate for all
modules. • Design need to be aware of the memory timing issues
(e.g. – back-to-back writes from FPGA to RAM is a problem)
• Calculating the correct polarity of CORDIC output samples.
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 13
Future scope
• Extending to 256 bit DFT.. Cannot extend to higher because resolution of CORDIC is low.. Need to increase cordic resolution
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 14
Lessons Learnt
• Debug on FPGA is interesting!!
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 15
Thank You!!
• No Questions!!! Please!! :x :p
EE382N-4 Abhik Bhattacharya,Mrinal Deo, Raghunandan R.K, Samir Dutt 16