CHAPTER 4 DDS USING HWP CORDIC ALGORITHMshodhganga.inflibnet.ac.in/bitstream/10603/23593/9/09_chapter4.pdf · 90 CHAPTER 4 DDS USING HWP CORDIC ALGORITHM 4.1 INTRODUCTION Conventional

90

CHAPTER 4

DDS USING HWP CORDIC ALGORITHM

4.1 INTRODUCTION

Conventional DDFS implementations have disadvantages in area

and power (Song and Kim 2004b). The conventional implementation of DDS

is a brute-force approach using a lookup table (Nicholas 1988). In this

approach, the lookup table size grows exponentially as more bits are used to

represent the sine and cosine waveforms, although a memory compression

method applied to reduce the size of the lookup table.

In 1959Volder described the Coordinate Rotation Digital Computer

or CORDIC for the calculation of trigonometric functions, multiplication,

division and conversion between binary and mixed radix number systems. A

pure computational algorithm such as CORDIC has been proposed to replace

the large-sized lookup table.

Several algorithms are proposed for calculation of sine and cosine

function. The digit-by-digit methods for the computation of the elementary

functions such as trigonometric, inverse trigonometric, logarithm,

exponential, multiplication, and division functions described by Henry Briggs

in 1624 in “Arithmetica Logarithmica” (Metafas and Goutis 1990).

The programmable CORDIC chip for DSP applications were

proposed by Timmermann et al (1991). These are iterative pseudo division

91

and pseudo multiplication processes, which resemble repeated-addition

multiplication and repeated-subtraction division.

In 2000, Volder had proposed a special purpose digital computing

unit known as CORDIC, while building a real time navigational computer for

use in an aircraft. This algorithm was initially developed for trigonometric

functions which were expressed in terms of basic plane rotations.

The CORDIC algorithm computes 2D rotation using iterative

equations employing shift and add operations. Walther (1971) proposed a

unified algorithm to compute rotation in circular, linear, and hyperbolic

coordinate systems using the CORDIC algorithm. The CORDIC algorithm

developed by using an unified approach was proposed which is used in many

applications (Walther2000).

The DDS architecture based on the differential CORDIC

(DCORDIC) algorithm has presented by Kang et al (2006). The digit-level

pipelining in the CORDIC angle path implemented a two-dimensional

systolic array. More recently, the advances in the VLSI technology and the

advent of EDA tools have extended the application of CORDIC algorithm to

the field of software defined radio, MIMO systems (Wang et al 2006) and

neural networks (Meyer-Base et al 2003) etc.

Optimized DDFS for complex demodulation using CORDIC was

discussed by Jridi (2009). DDFS with 8Hz tuning frequency resolution and 20

bits output data (for sine and cosine waves) implemented in Xilinx FPGA

device giving a maximum operating frequency of more than 306 MHz and a

SFDR of 112dBc.

92

Sung et al (2009) had reported Hybrid CORDIC algorithm for

design and implementation of the DDFS. The multiplier-less architecture with

small ROM and pipelined data path provides a spurious free dynamic range of

84.4dBc. Later on, Eugene (1998) had proposed DDFS based on the Modified

Coordinate Rotation (CORDIC) algorithm.

A second order parabolic approximation for sine function has been

proposed by Sodagar (2001) which is so close to the sine function that

satisfies the accuracy requirements for sine computation in sine-output DDS.

The ROMless Parabolic DDS was designed with a 32-bits phase accumulator

whose output was truncated to 12 bits and the synthesizer had a 10-bit output.

The maximum operating frequency of this method was 175MHz, frequency

resolution was 0.0029Hz and maximum harmonic level was found to be

64dBc.

The CORDIC is an iterative approximation method, which can be

implemented without a lookup table but generates higher spurious harmonic

tones. In order to reduce the spurious tones, a significant number of iterations

and high resolution of the data path are required.

Using a computational algorithm and a lookup table in a combined

approach can offer better optimization in hardware complexity and speed

(Torosyan et al 2002). Further reduction of the lookup table size is desired as

the implementation cost of a digital communication system becomes more

important.

Antelo et al (2008) proposed the unfolded and pipelined CORDIC

by using linear approximation. The linear approximation schemes for rotation

93

(multiplication) and vectoring (division) were complicated the

implementation in a single unit.

This design demands on high resources on FPGA with more power.

Although to minimize the delay, the scale factor compensation technique was

combined with linear approximation. Even though, the pipelined CORDIC

used more number of registers to improve the throughput.

Later, Jun Ma et al (2000) developed the CORDIC-based IIR

digital filters based on fast orthogonal micro rotations for the realization. Each

orthogonal section realizes one real zero or a pair of complex conjugate zeros

of the transfer function. The CORDIC based IIR filter implementation leads

low sensitivity to finite word-length truncation in the filter stop band.

The CORDIC algorithm is performing vector rotations by arbitrary

angles using only shift and add operations. Volder’s algorithm is derived from

the general equations for a vector rotation. If a vector V with Coordinates

(x, y) is rotated through an angle then a vector can be obtained with

coordinates (x , y ) where x and y can be obtained using (x,y) and by the

following method

X= rcos , Y= rsin (4.1)

V = = (4.2)

94

Figure 4.1 Rotation of a vector ‘V’ by an angle

As shown in Figure 4.1, a vector V(x,y) can be resolved in two

parts along the x-axis and y-axis as rcos and rsin respectively. Figure 4.2

illustrates the rotation of a vector V= by the angle .

Figure 4.2 Vector ‘V’ with magnitude r and phase

i.e x= r cos & y= r sin (4.3)

95

Similarly, from Figure 4.1 it can be seen that vector V and V can

be resolved in to two parts. Let V has its magnitude and phase as r and

respectively and V has its magnitude and phase as r and where V came in to

picture after anticlockwise rotation of vector V by an angle . From

Figure 4.1 it can be observed

- = (4.4)

= + (4.5)OX = x =r cos

= r cos( )

= r (cos .cos -sin .sin )

= (rcos )cos -(rsin )sin (4.6)

Using Figure 4.2 and equation 4.3 OX can be represented as

OX =x =xcos – ysin (4.7)

Similarly, OY

OY =y =ycos +xsin (4.8)

Similarly, value for the vector V in the clockwise direction rotating

the vector V by the angle and the equations obtain in this case becomes

x = xcos + ysin (4.9)

y = xsin – ysin (4.10)

The Equations (4.7) to (4.10) can be represented in the matrix form as

= (4.11)

The individual equations for x and y can be rewritten as:

96

x = xcos( ) ± ysin( ) (4.12)

y = ycos( )±xsin( ) (4.13)

Volder observed that by factoring out a cos from both the sides,

resulting equation be in terms of the tangent of the angle , the angle of

which to find sine and cosine. Then this equation can be rewritten as an

iterative formula.

x = cos (x ± ytan( ) (4.14)

y = cos (y±xtan( ) (4.15)

z =z ± , here is the angle of rotation (± sign is showing the

direction of rotation) and z is the argument. For the ease of calculation here

only rotation anticlockwise direction is observed first. Rearranging the

Equation (4.7) and (4.8)

x = cos( ) ((x-ytan( )) (4.16)

y =cos( )(y+xtan( )) (4.17)

The multiplication by the tangent can be avoided in the rotation

angles and therefore tan( ) is restricted so that tan( ) = 2-i. In the digital

hardware this denotes a simple shift operation. Furthermore, if those rotations

are performed iteratively and in both directions every value of tan( ) is

representable. With = arctan(2-i ) the cosine term could also be simplified

and since cos ( ) = cos(- ). It is a constant for a fixed number of iterations.

97

This iterative rotation can be expressed as:

Xi+1 = ki(xi-yi.di.2-i) (4.18)

Yi+1 = ki(yi-xi.di.2-i) (4.19)

Where ‘i’ denotes the number of rotation required to reach the required angle

of the required vector ki= cos (arctan(2-i)) and di = ±1. The product of the ki’s

represent K factor (Walther 1971):

K= (4.20)

is the angle of rotation here for n times rotation).

Table 4.1 12-bit CORDIC hardware

i tan i=2-ii=arctan(2-i) i in radians

0 1 45 0.7854

1 0.5 26.565 0.4636

2 0.25 14.036 0.2450

3 0.125 7.125 0.1244

4 0.0625 3.576 0.0624

5 0.03125 1.7876 0.0312

6 0.015625 0.8938 0.0156

7 0.0078125 0.4469 0.0078

8 0.00390625 0.22345 0.0039

9 0.001953125 0.111725 0.00195

10 0.0009765625 0.0558625 0.000975

11 0.00048828125 0.02793125 0.0004875

98

K is the gain and its value changes as the number of iteration

increases. For 8-bit hardware CORDIC approximation method the value of Ki

as K= cos i = cos 0. cos 1 … … cos 7

= cos 45°cos26.565° … … … … … cos 0.4469°.

=0.6703 (4.21)

From the Table 4.1 it can be seen that precision up to 0.4469 is

possible for 8-bit CORDIC hardware. This l are stored in the ROM hardware

of the CORIC hardware as a look up table.

4.2 IMPLEMENTATION OF CORDIC ARCHITECTURES

In this chapter, different hardware architecture for sine and cosine

computation using CORDIC has been presented. If the sine and cosine

functions have been implemented in digital hardware, it needs more number

of multipliers for many algebraic methods. The alternative techniques such as

polynomial approximation, table-lookup method as well as shift and add

algorithms have reported by Andraka (1998). FPGAs are suitable for

hardware implementation of CORDIC algorithm as with other hardware

circuitry.

The CORDIC algorithm only performs shift and add operations and

it can be easy to implement and also suitable for FPGA based design. It is

necessary to analyze the various hardware architecture of CORDIC for the

present research work. This work has considered four different hardware

architectures of CORDIC. Its speed and area performance has been analyzed.

The most obvious methods of implementing a CORDIC such as bit-serial, bit-

parallel, unrolled and iterative, are described and compared in the following

sections.

99

4.2.1 Iterative CORDIC Architecture

The CORDIC structure as described in Equations (4.10) to (4.14) is

represented by the schematics in Figure 4.3 when directly translated into

hardware. Each branch consists of an adder-subtractor combination, a shift

unit and a register for buffering the output. At the beginning of a calculation

initial values are fed into the register by the multiplexer, where the MSB of

the stored value in the z-branch determines the operation mode for the adder-

subtractor unit. The data in the x and the y branch pass the shift units and are

then added to or subtracted from the unshifted data in the opposite path.

Figure 4.3 Block Diagram of Iterative CORDIC Architecture

The z branch arithmetically combines the register values with the

values taken from a lookup table whose address is changed according to the

number of iterations. For n iterations the output is mapped back to the

registers before initial values are fed. Then the final sine value can be

accessed at the output. The design has been implemented in a FPGA the

initial values for the vector coordinates as well as the constant values in the

100

LUT can be hardwired. The adder and the subtractor components are carried

out separately in the architecture. The multiplexer unit is controlled by the

sign of angle accumulator, which distinguishes between addition and

subtraction operations. The shift operations require a high fan in and reduce

the maximum speed when numbers of iterations are more. In addition the

throughput rate is also limited by the operations that are performed iteratively.

4.2.2 Unrolled CORDIC Architecture

In Unrolled CORDIC architecture, the output of one stage is the

input of the next one as shown in Figure 4.4. First, the shift operations for

each step can be performed by wiring the connections between stages

appropriately. Second, there is no need for changing constant values and that

can be hardwired as well. The purely unrolled design only consists of

combinatorial components and computes one sine and cosine value per clock

cycle. The input values find their path through the architecture on their own

and do not need to be controlled.

Obviously the resources in a FPGA are not very suitable for this

kind of architecture. A bit-parallel unrolled design with 16 bit word length,

each stage contains 48 inputs and outputs with a great number cross-

connections between single stages. These cross connections from the x-path

through the shift components to the y-path and vice versa. It makes the design

difficult to route in a FPGA and cause additional delay times.

From Table 4.2 shows the performance and resource usage change

with the number of iterations if implemented in a XILINX FPGA. The area in

FPGAs can be measured in CLBs. The CLB consists of two lookup tables as

well as storage cells with additional control components (Andhraka 1998).

101

For the purely combinatorial design the CLBs function generators perform the

add and shift operations and storage cells are not used. This means registers

could be inserted easily without significantly increasing the area.

Figure 4.4 Block Diagram of Unrolled CORDIC Architecture

102

However, inserting registers between stages would also reduce the

maximum path delays and also increases the speed. It can be seen that the

number of CLBs are increased with the reduction in the maximum frequency.

The reason for that is the decreasing amount of combinatorial logic between

sequential cells. Obviously, the gain of speed when inserting registers exceeds

the cost of area and therefore the fully pipelined CORDIC is a suitable

solution for generating a sine wave in FPGA’s.

Table 4.2 Performance and CLB usage in Xilinx Spartan 3 FPGA

No. of Iterations 8 9 10 11 12

Complexity(CLB) 184 208 232 256 280

Max path delay(ns) 163.75 177.17 206.9 225.72 256.87

4.2.3 Pipelined CORDIC Architecture

Both the Unrolled and the iterative bit-Parallel designs, show

disadvantages in terms of complexities and path delays along with the large

number of cross connections between single stages. In order to reduce the

complexity, the architecture of bit parallel unrolled CORDIC has been

modified. The modified architecture is known as Bit serial iterative CORDIC

(Wang 2005). Bit-serial means only one bit is processed at the time and hence

the cross connection become one bit-wide data paths.

103

The throughput of pipelined CORDIC is as high as with the

unrolled design as shown in Figure 4.5. The structural simplicity of a bit-

serial design has been achieved high throughput with speed.

Figure 4.5 Block diagram of Pipelined CORDIC Architecture

The pipeline CORDIC design is implemented in a XILINX Spartan

device. The performance is constrained by the use of multiplexers for the shift

operation and even more for the constant LUT. The latter could be replaced

by a RAM or serial ROM where values are read by simply incrementing the

memory’s address.

104

4.2.4 HWP CORDIC Architecture

The development of the HWP CORDIC architecture was carried

out for achieving the highest throughput rate and reduction of hardware-

complexity as well as the computational latency of implementation. Some of

the typical approaches for reducing complexity implementation are targeted

on minimization of using the scaling-operation and complexity of barrel-

shifters and adders in the CORDIC engine. Angle recoding schemes, mixed-

grain rotation and higher radix CORDIC have been developed for reduced

latency realization. The parallel and pipelined CORDIC have been suggested

for high-throughput computation (Lakshmi 2010). The inherent drawbacks of

the conventional pipeline CORDIC algorithm are computational latency and

hardware complexity. Hence, an attempt has been made in the present

research work to concentrate on (i) reduction of computational latency (ii)

reduction of clock routing and hardware complexity of the pipelined CORDIC

architecture based-on Hybrid wave pipelining technique.

105

Figure 4.6 Block digram of HWP CORDIC Architecture

Figure 4.6 shows the architecture of the present HWP CORDIC

processor. This architecture includes X register, Y Register, Angle Register,

Adder/Subtractor and Multiplexers. The Adder/Subtractor module has been

implemented with basic full Adder Circuit. The HWP CORDIC processor can

accept 12 bit input data and produce sine and cosine outputs. The X Register,

Y Register and Angle Registers were separated with D-Flip flip or Register to

compensate the path delays and increases the throughput. A multiplexer can

be used to change position according to the current iteration. The initial

values X0, Y0 and Z0 are fed into the array at the left end of this serial-in

serial-out register. Finally, when all iterations are passed the input

Reg

Reg

Reg

Reg

Reg

Reg

X Register

Angle Register

Y Register

Adder/Subtractor

Adder/Subtractor

Adder/Subtractor

X0

y(n)Y0

Sign(z)

Sign(z)

Z0

x(n)

z(n)

Sign(z)

Constant LUT

106

multiplexers switch again and initial values enter the HWP CORDIC

processor as the computed sine and cosine values exit.

The HWP CORDIC architecture has been implemented on to

Xilinx Spartan FPGA and its critical paths and path delay characteristics were

analyzed. In the conventional pipeline CORDIC architecture Dmax has been

observed at 5.265ns and Dmin was found to be 2.345ns. The maximum and

minimum delay difference was measured as 2.92ns. In HWP CORDIC

architecture the critical paths were adjusted and it was found to be 4.365ns

(Dmax) and 2.784nsec (Dmin) respectively. The path delay difference was

measured at 1.581ns. From the observation it is clear that the Dmax and Dmin

values are less in the present HWP CORDIC architecture.

4.3 COMPARISON OF CORDIC ARCHITECTURES

Various CORDIC architectures used for FPGA implementations

have been discussed in previous section. Table 4.3 illustrates the comparison

between the various CORDIC architectures such as iterative, unrolled,

Pipelined and HWP CORDIC. The iterative design stands out due to its low

area usage and low speed, whereas the maximum throughput rate is much

lower compared to the bit-parallel designs. The bit-parallel unrolled and fully

pipelined design uses the resources extensively but shows the best latency per

sample and maximum throughput rate. The HWP CORDIC design provides a

balance between unrolled and bit-serial design. It shows an optimum usage of

the resources with high speed and maximum throughput rate.

107

Table 4.3 Performance and Area usage for the various CORDIC

architectures

Architecture Type # Slices # LUT #Flip-flops Speed (MHz)Iterative CORDIC 52 63 68 48Unrolled CORDIC 68 84 78 76Pipelined CORDIC 57 72 56 96HWP CORDIC 36 33 44 188

4.4 RESULTS AND DISCUSSION

The present research work has also taken the comparison amongvarious CORDIC architecture. The VHDL code has been developed forPipeline CORDIC and its architecture has been modified into HWP CORDICby altering the critical path delays. From the timing details of the synthesisreport, the maximum(Dmax)and minimum(Dmin) path delays have beenidenified after downloading the source code into Spatran FPGA. The inputdata have been applied and delay values are altered by using Xilinx FPGAEditor.The Figure 4.6a shows the ModelSim simulation result of pipelinedCORDIC.The output samples have been taken and sine and cosine outputwaveforms are plotted in Figure 4.6b.

Figure 4.6a Sine and Cosine output of CORDIC using ModelSim6.2g

108

Figure 4.6b Sine and Cosine waveform generated by CORDIC samples

The conventional pipelinedCORDIC and HWP CORDIC basedDDS architectures have been considered for performance analyis ofCORDIC. From the comparion result, that the pipeline CORDIC and HWPCORDIC produce better performance than other two architectures. Hence,pipelined CORDIC and HWP CORDIC have been taken for DDS systemintegration. This chapter deals with the performance analysis of both pipelineCORDIC DDS and HWP CORDIC DDS. Figures 4.7and 4.11 illustrate thedesign summary of pipeline CORDIC DDS and HWP CORDIC DDSarchitetcure respectively. The pipeline CORDIC DDS uses 47 slices, 41 sliceflipflops, 73 four input LUTs, 48 IOBs and 40 IOB Flipflops. Whereas HWPCORDIC DDS utilizes 45 slice flipflops, 32 four input LUTs, 83 IOBs and 43

-150

-100

-50

0

50

100

150

1

Ang

le in

deg

rees

Time in ns

Sine Plot

Sine Value

-150

-100

-50

0

50

100

150

Ang

le in

deg

rees

Time in ns

Cosine Plot

Cosine

109

IOB flipflops. Figures 4.8 and 4.12 show the RTL schematic of pipelineCORDIC DDS and HWP CORDIC DDS.

Figure 4.7 Design summary of Pipelined CORDIC DDS

Figure 4.8 RTL Schematic of pipelined CORDIC DDS architecture

110

Figure 4.9 Power summary of pipelined CORDIC DDS

Figure 4.10 shows the block diagram of HWP CORDIC DDSarchitecture. This DDS design includes the conventional phaseaccumulatorand HWP CORDIC. The input of the PA is 32bits and its outputis 16bits. The 16bits are quantized into 12bits and it is given to HWPCORDIC. If the 16bit input is used for the HWP CORDIC the number ofiterations are increased in sine and cosine calculation. In order to minimizethe hardware complexity 12bit input is used for HWP CORDIC.

Phase Accumulator

fclk

SineFCWInput

Cosine

Figure 4.10 Block Diagram of HWP CORDIC DDS architecture

111

Figure 4.11 Design summary of HWP CORDIC DDS architecture

The pipeline CORDIC DDS has been consumed 82mW of powerand HWP CORDIC DDS consumed 78mW of power dissipation as shown inFigures 4.9 and 4.13.

Figure 4.12 RTL Schematic of HWP CORDIC DDS

112

Figure 4.13 Power summary of HWP CORDIC DDS

Table 4.4 illustrates the performance comparision of area and

speed of the DDS architectures pipelined CORDIC and HWP CORDIC. The

area usuage of pipeline CORDIC DDS and HWP CORDIC DDS are almost

equal in terms of Flipflops and Four input LUTs. The total estimated power

of HWP CORDIC DDS has been reduced to 78mW and throughput is

increased from 75% to 81% with 176.34MHz frequency improvement.

Table 4.4 Performance comparison of Area and Speed

Parameters Pipelined CORDIC DDS HWP CORDIC DDS

# slice flip-flop 41 45

# Slices 47 23

# 4input LUT 73 32

# Bonded IOB 48 83

Total estimated Power Consumption(mW)

82 78

Throughput (%) 75 81

Maximum frequency(MHz) 164.26 176.34

113

Figure 4.14 shows the ModelSim simulation result of HWP

CORDIC DDS architecture design. It uses 100MHz reference clock and

generates the high frequency of 47MHz output. The same design can

produce minimum frequency to maximum frequencyas shown in Figures 4.15

to 4.21. The common reference clock frequency and input bits have been

consider for this work. Because these values can produce the optimum

outputs such as frequency resolution, speed, area, throughput, power and

SFDR etc. Figure 4.15 has taken 28F5C28h FCW and it can generate the

1MHz output sine and cosine waveform.

Similarly, Figures 4.16, 4.17, 4.18 and 4.19 shows the output

frequencies such as 1.7, 3.5, 4.5 and 7.5MHz. The correcsponding FCWs are

45A1CACh, 8F5C28Fh, 251EB851h and 13333333h respectively. The output

frequencies 25 and 50MHz also has been generated by applying the FCWs of

40000000h and 80000000h respectively as shown in Figures 4.20 and 4.21.

114

Figure 4.14 ModelSim output of HWP CORDIC DDS architecture

Figure 4.15 1MHz HWP CORDIC DDS output

115

Figure 4.16 1.7MHz HWP CORDIC DDS output



116




117

4.5 CONCLUSION

In this chapter, the researcher discusses the basic CORDIC

equation for sine and cosine calculation and analyzed about various CORDIC

architecture. The VHDL code has been developed for HWP CORDIC

architecture and it is simulated using Xilinx ISE9.2i and ModelSim6.2g. The

sine and cosine samples are taken and plotted. The HWP CORDIC has been

developed with DDS top module, the synthesis and simulation results are

verified. The Pipeline PA with pipelined CORDIC based DDS has been

compared with HWP CORDIC based DDS design in terms of speed and area.

From the synthesis and implementation, it is clear that HWP CORDIC has

produced the better performance than pipelined CORDIC.

Documents

CHAPTER 4 DDS USING HWP CORDIC ALGORITHMshodhganga.inflibnet.ac.in/bitstream/10603/23593/9/09_chapter4.pdf · 90 CHAPTER 4 DDS USING HWP CORDIC ALGORITHM 4.1 INTRODUCTION Conventional