Upload
1234sharada
View
196
Download
8
Embed Size (px)
Citation preview
Faculty of Computing and Information TechnologyDepartment of Robotics and Digital TechnologyTechnical Report 94-9A VHDL Implementation of a CORDICArithmetic Processor ChipGrant Hampson, Student Member, IEEEAndrew Papli�nski, Member, IEEEOctober 10, 1994Enquiries:-Technical Report CoordinatorRobotics and Digital TechnologyMonash UniversityClayton VIC [email protected] +61 3 905 3402
ContentsAbstract and Keywords 4Preface 51 The CORDIC Algorithm 62 CORDIC Hardware Implementations 102.1 CORDIC Processor Architecture : : : : : : : : : : : : : : : : : : : : : : : 102.1.1 A Word-Serial CORDIC Architecture : : : : : : : : : : : : : : : : : 102.1.2 A Word-Parallel CORDIC Architecture : : : : : : : : : : : : : : : : 113 Improving CORDIC Accuracy 143.1 Estimation of CORDIC Accuracy : : : : : : : : : : : : : : : : : : : : : : : 143.2 The Lower Bound of CORDIC Accuracy : : : : : : : : : : : : : : : : : : : 153.3 Reducing the z update error : : : : : : : : : : : : : : : : : : : : : : : : : : 163.4 Unexpected Truncation Errors : : : : : : : : : : : : : : : : : : : : : : : : : 204 VHDL Implementation 214.1 The Basic CORDIC Unit : : : : : : : : : : : : : : : : : : : : : : : : : : : : 214.2 VHDL Describes Structure and Behaviour : : : : : : : : : : : : : : : : : : 224.2.1 Hierarchical vs Flat Designs : : : : : : : : : : : : : : : : : : : : : : 234.2.2 The Viewlogic Synthesiser : : : : : : : : : : : : : : : : : : : : : : : 254.3 VHDL Design of the CORDIC Unit : : : : : : : : : : : : : : : : : : : : : : 264.3.1 The Rounding Unit : : : : : : : : : : : : : : : : : : : : : : : : : : : 294.4 Combining the CORDIC Units : : : : : : : : : : : : : : : : : : : : : : : : 304.4.1 A Solution : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 314.5 Improvements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33Conclusion 34A CORDIC Functions 35B Upper Bound of CORDIC Error 37References 381
List of Tables1.1 Elementary angles of �i : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81.2 Various values of Kn : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 84.1 Some CORDIC hardware statistics. : : : : : : : : : : : : : : : : : : : : : : 33A.1 The six CORDIC modes. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36
2
List of Figures1.1 Rotation of a point in 2-D space. : : : : : : : : : : : : : : : : : : : : : : : 62.1 Generic Processor Architecture. : : : : : : : : : : : : : : : : : : : : : : : : 112.2 A Optimised Word-Serial CORDIC Architecture. : : : : : : : : : : : : : : 122.3 Word-Parallel CORDIC architecture with possible data pipelining. : : : : : 133.1 Numerical accuracy of the CORDIC processor. : : : : : : : : : : : : : : : : 153.2 Predicted and Actual accuracy of a CORDIC processor with a 12 bit in-ternal datapath. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153.3 A plot showing bits of error for a typical test vector rotated through allpossible angles. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 163.4 A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. : : : 173.5 An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results. : : : 173.6 Simulation results from a CORDIC processor illustrating the e�ects of thenormalisation scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 193.7 An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding. 204.1 The basic CORDIC unit. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 214.2 A Hierarchical Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 244.3 A Flat Design of the Adder/Subtracter for n = 4. : : : : : : : : : : : : : : 244.4 A Behavioural Design of the Adder/Subtracter for n = 4. : : : : : : : : : : 254.5 The structure of CORDIC unit showing the various entities. : : : : : : : : 284.6 The top level schematic of an 4 stage CORDIC processor with IncreasedConvergence Range and Rounding components. : : : : : : : : : : : : : : : 323
AbstractThis report describes the fundamentals of CORDIC (Co-ordinate Rotations Digital Com-puter) algorithm and a possible implementation using the VHDL hardware descriptionlanguage. An analysis of errors associated with a �xed point implementation of CORDICis also discussed and methods for reducing these errors. A normalisation scheme whichreduces error and requires no extra hardware is such a method. Various CORDIC struc-tures and possible VHDL implementations are described in detail, including design andlanguage issues. Finally a parallel hardware implementation is described and simulated.CORDIC has many applications, of which, some can be used for array imaging tech-niques.KeywordsCORDIC, VHDL
4
PrefaceCORDIC is an acronym for Coordinate Rotations Digital Computer and was derived byVolder [1] in the late 1950's for the purpose of calculating trigonometric functions. Itspopularity came about nearly twenty years later when VLSI solutions became a reality.The original algorithm describes the rotation of a 2-D vector which can be appliedin applications such as Digital Signal Processing [2] (Fourier Transforms, Digital Filters),Computer Graphics [3] and Robotics [4].CORDIC processing o�ers high computational rates making it attractive to applica-tions such as computer graphics where a combination of scaling and rotations are requiredin real time. CORDIC is also attractive to Robotics since the fundamental operation iscoordinate transformations, however it could be used for more computationally intensiveprocesses such as motion planning and collision detection.Array Imaging typically involves complex signal processing which may require manycomputationally intensive matrix operations. Increasing the complexity of the imagingmodel places greater demands on accuracy. Solutions to such complex systems requiresbetter, and hence, more complex algorithms. Most of these algorithms are based on matrixfactorization (decomposition) techniques, of which Singular Value Decomposition (SVD)is the most robust method. The SVD factorisation requires a two-sided transformationwhich involves several trigometric operations and rotations ideally suited to dedicatedVLSI hardware (CORDIC processing) for real time calculations. CORDIC has also beenapplied to phase correction when dynamic range focusing when Digital Baseband Demod-ulation [5] techniques are employed in Interpolation Beamforming [6] . A complex signalis represented by the in-phase, I, and quadrature, Q, components, and are phase correctedby rotating the complex signal.Haviland and Tuszynski designed and built a CORDIC processor [7] in 1980 whichused a iterative process to calculate circular, linear and hyperbolic functions. A morerecent implementation (1993) by Duprat and Muller [8] discusses the possibility of usinga redundant number system for the representation of a signed digit.This report is broken into four logical sections, namely, CORDIC Theory, HardwareImplementations, Improving CORDIC Accuracy and �nally a VHDL Implementation.5
Chapter 1The CORDIC AlgorithmConsider a 2-D vector (x; y) represented by a point v = x+ |y in the complex plane. Ifthe vector is rotated by an angle �, the new co-ordinate vector is given by:~v = v ej� (1:1)and shown in Figure (1.1). yx� ~v = ~x+ |~yv = x+ |yFigure 1.1: Rotation of a point in 2-D space.The angle � can be expanded into a set of elementary angles �i with pseudo-digitsqi 2 f�1;+1g, and angle expansion error zn, such that� = n�1Xi=�1 qi � �i + zn (1:2)and the sub-rotation angles �i take on the following values:�i = ( �=2 for i = �1arctan(2�i) for i = 0; 1; � � � ; n� 1 (1:3)Note that �i is approximately equal to but less than 2�i and the resulting angular expan-sion error is therefore jznj <� 2�(n�1). 6
Substitution of Equation(1.2) into Equation (1.1) gives:~v = v � n�1Yi=�1 e| qi �i � e| zn= v � (|qi) � n�1Yi=0 e| qi �i � e| zn (1.4)and expanding ejqi�i, ejqi�i = cos qi�i + j sin qi�i= cos qi�i (1 + j tan qi�i)= cos�i �1 + j qi 2�i�Finally ~v = v � n�1Yi=0 cos�i! � (|q�1) � n�1Yi=0 �1 + | qi 2�i�! � e�j zn (1:5)The range of rotation angles which can be represented by Equation (1.2) is ��max, where�max = n�1Xi=�1�i � 190� (1:6)and some values of �i are given in Table (1.1).If the expected range of rotation angles is �90� then the initial rotation by 90�, thatis, e| q�q �2 = j q�1, does not have to be performed and the initial rotation is by �45�.The second term is a constant scaling factor and for given value of n it can be pre-evaluated using Equation (1.7), and the �rst 15 evaluated in Table (1.2).Kn = n�1Yi=0 cos�i = n�1Yi=0 �1 + 2�2i�� 12 = n�1Yi=0 1q1 + 14i (1:7)The basic CORDIC algorithm which describes rotation of a unity length vector v =x+|y by an angle � can be derived from Equation (1.5) using the initial conditions, wherezi is the accumulated angular residue:v�1 = v �Knz�1 = �And, proceeding with i = �1; 0; � � � ; n� 1qi = ( �1+1 if zi < 0� 0 (1.8)vi+1 = ( vi � |qi if i = �1vi (1 + |qi � 2�i) if i � 0 (1.9)zi+1 = zi � qi�i (1.10)7
i Angle Angle (degrees) 16-bit binaries0 arctan(20) 45:0000� B400 = 110001:00000000001 arctan(2�1) 26:5651� 6A43 = 011010:10010000112 arctan(2�2) 14:0362� 3825 = 001110:00001001013 arctan(2�3) 7:1250� 1C80 = 000111:00100000004 arctan(2�4) 3:5763� 0E40 = 000011:10010000005 arctan(2�5) 1:7899� 0729 = 000001:11001010016 arctan(2�6) 0:8952� 0395 = 000000:11100101017 arctan(2�7) 0:4476� 01CA = 000000:01110010108 arctan(2�8) 0:2238� 00E5 = 000000:00111001019 arctan(2�9) 0:1119� 0073 = 000000:000111001110 arctan(2�10) 0:0560� 0039 = 000000:000011100111 arctan(2�11) 0:0280� 001D = 000000:000001110112 arctan(2�12) 0:0140� 000E = 000000:000000111013 arctan(2�13) 0:0070� 0007 = 000000:000000011114 arctan(2�14) 0:0035� 0004 = 000000:000000010015 arctan(2�15) 0:0017� 0002 = 000000:000000001016 arctan(2�16) 0:0008� 0001 = 000000:0000000001Table 1.1: Elementary angles of �in Kn0 0.707106781186551 0.632455532033682 0.613571991077903 0.608833912517754 0.607648256256175 0.607351770141306 0.607277644093537 0.607259112298898 0.607254479332569 0.6072533210898810 0.6072530315291311 0.6072529591389412 0.6072529410414013 0.6072529365170114 0.6072529353859115 0.60725293510314Table 1.2: Various values of Kn8
The �nal rotated vector is vn, with angle expansion error znvn = ~v = v � e|� � e�|zn (1.11)zn = �� n�1Xi=�1 qi�i (1.12)One complex operation on vi is equivalent to two operations on real numbers. For i = �1x0 + |y0 = |q�1(x�1 + |y�1)Hence =) x0 = �q�1y�1 (1.13)y0 = q�1x�1 (1.14)For i = 0; 1; � � � ; n� 1 xi+1 + |yi+1 = (xi + |yi)(1 + |qi � 2�i)Hence =) xi+1 = xi � qi � yi � 2�i (1.15)yi+1 = yi + qi � xi � 2�i (1.16)The CORDIC algorithm reduces to an iterative set of operations consisting of a binaryshift and an accumulator for each of x; y and z.Refer to Appendix A for a list of transcendental functions.
9
Chapter 2CORDIC HardwareImplementationsA Hardware implementation of CORDIC processor is dependent on the number of func-tions required and the computational speed. If all functions are to be computed, thenthere will be a necessary overhead for selecting each function. However, a small fast de-sign will result if a small number of functions are required. This chapter presents possiblesolutions to a mixture of design problems.2.1 CORDIC Processor ArchitectureA CORDIC algorithm can take on two primary architectures, namely, word serial or wordparallel. A word-serial processor minimises hardware requirements by utilising a singleCORDIC unit repeatedly. However, iterative algorithms which are controlled by a smallnumber of variables can be expanded on a two-dimensional area. ie., instead of executinga certain set of instructions n times using a single element (eg., a CORDIC unit), n timesduplicated elementary cells are used in successive steps of an iteration [9]. This attenedstructure can now perform many operations in parallel and is so called a word-parallelCORDIC processor.A word-parallel architecture has the advantage of being up to n times faster, but dueto the expansion requires, at worst, n times more hardware. However, the word-serialarchitecture requires complex controlling hardware and a variable shifter, decreasing thehardware saving ratio.2.1.1 A Word-Serial CORDIC ArchitectureThe CORDIC algorithm has the advantage of not requiring any special hardware otherthan an accumulator and a variable shifter which are generally available in most micro-controllers.A multi-function word-serial CORDIC processor architecture could be realised usinga basic micro structure consisting of a two-port register �le, a variable shifter combinedwith an ALU interconnected by several data paths as shown in Figure (2.1).A generic controller could consist of a microcode instructions for the ALU and register10
VariableShifterROMKn 's RegisterFile Controllingmicro-codeCC register2�i � yi or 2�i � xiiin ALU Result bus: xi+1, yi+1, zi+1Input data buses: xi, yi, ziROM�i 'sFigure 2.1: Generic Processor Architecture.�le, and would execute an iterative algorithm. This structure is simular to that of amicroprocessor or DSP and allows many variations of the CORDIC algorithm as theorder of operations and the expanded instruction set increases exibility. This type ofstructure illustrates that it would be possible to implement the CORDIC algorithm onany micro or DSP.Optimising the generic processor-structure for a word-serial CORDIC processor isachieved by reducing the functionality to operations only required by the CORDIC algo-rithm. A possible word-serial architecture is shown in Figure (2.2) where the ALU nowcontains three adders and dedicated registers. The microcode controller has been replacedby faster Combination Control Logic dedicated to the CORDIC operation sequence.2.1.2 A Word-Parallel CORDIC ArchitectureThe word-parallel method expands the problem of a single dimensional algorithm intoa two-dimensional problem and results in shorter computational times. Greater speedsof computation can be obtained by pipe-lining between stages so that many partial re-sults can be calculated in parallel. A pipelined-word-parallel architecture is shown inFigure (2.3) where each iteration is represented by a separate CORDIC block and a latchis placed after each iteration, or, several iterations.The following chapters will develop, implement, and simulate such parallel CORDICstructure using the VHDL hardware description language.
11
PPCombinationalControlLogic counterm-bit register
FinishedFlagxixi+1 yi+1Pn-bit register n-bit register n-bit register
yii qix0 y0z }| {Initial Inputsqiyi2�i�qixi2�i
ClockSelectNext State q-bit registerIncrement Zero
ResetPrecisionLoadClock z0 zi LookupTableof�i'szi+1Figure 2.2: A Optimised Word-Serial CORDIC Architecture.12
Latch for Pipelining of dataClockLatch for Pipelining of dataClock Cell #n
y0y1 x1x0 �1z0 �0P P Pyi+1 xi+1qi � yi � 2�i�qi � xi � 2�iyi xi zi
zi+1yn xn zn
qi = sign[zi] �iCell #0
Cell #i�n�1Figure 2.3: Word-Parallel CORDIC architecture with possible data pipelining.13
Chapter 3Improving CORDIC AccuracyAs expected, iterative algorithms calculate results by approximation and the solution willcontain errors. CORDIC is not an exception and errors are introduced by a combina-tion of quantisation and approximation errors. The accuracy of a CORDIC processor isdependent on the word length used for the three input variables x; y, and z, as well asthe number of iterations or steps performed. The following chapter describes the errorsassociated with a �xed point implementation and a means of reducing these errors.3.1 Estimation of CORDIC AccuracyThe fundamental operations performed by a CORDIC processor is the shift-and-add pro-cess of which �xed point arithmetic will introduce errors. For example, consider the binaryscaling of the vector vi = (xi; yi) at the ith stage:if i � m then vi+1 is updated with the truncated value vi � 2�iif i > m then vi+1 = vi ; and the update will be 0wherem is the internal bus width of v and limits the maximumnumber of useful iterations.Peak accuracy could be achieved after m iterations since all accuracy has been exhaustedin v. However, truncation errors may exceed the accuracy achieved by more iterations,and it is desirable to �nd the optimal number of iterations.The accuracy of the rotation will be determined by how closely the input rotationangle was approximated by the summation of sub-rotation angles �i. The error in v aftern iterations will be proportional to the error in z. An increase in the z datapath widthwill increase the accuracy of the z update and hence the v update.The numerical accuracy of the CORDIC algorithm can be calculated by the examina-tion of truncation and approximation errors. Truncation errors are due to the �nite wordlength and approximation errors are due to the �nite number of iterations. Walther [10]analyzed the x and y iterations independently of the z iterations and concluded that log nextra bits in the data paths can provide n bits of accuracy. This work was re-calculatedby Kota and Cavallaro[11] in a non-independent manner and concluded that log n + 2extra bits are required to achieve n bits of accuracy after n iterations.14
This solution represents an upper bound of error in the CORDIC processor. A graphof this function appears in Figure (3.1) from which it can be seen that to achieve 8 or 16bit accuracy, the internal datapaths need to be 13 and 22 bits respectively.0 4 8 12 16 20 24 28 32 36 40
0
4
8
12
16
20
24
28
32
Internal Datapath Width (n+log(n)+2)
Out
put r
esol
utio
n is
(n)
bits
with
(n)
iter
atio
ns
Datapath resolution vs Output Resolution
Figure 3.1: Numerical accuracy of the CORDIC processor.3.2 The Lower Bound of CORDIC AccuracyA CORDIC processor can be presented with all possible input combinations to �nd thelower bound of error. Simulation results are shown in Figure (3.2) where a 12 bit CORDICprocessor with a variable number of stages is presented with all possible rotation anglesbetween �� � z�1 � � and the resulting accuracy in bits is calculated. Kota and Caval-laro's upper bound of error (as de�ned by their maximumerror equation in Appendix (B))is also shown in Figure (3.2). The upper bound of error has a well de�ned peak of accu-racy, however the simulation results indicate that accuracy will improve if more iterationsare performed.0 2 4 6 8 10 12
0
2
4
6
8
10
12
Number of stages n
Out
put A
ccur
acy
Solid: Predicted Accuracy, Dashed: Actual Accuracy
Figure 3.2: Predicted and Actual accuracy of a CORDIC processor with a 12 bit internaldatapath. 15
Figure (3.3) illustrates the accuracy of a 12 bit, 12 stage processor, by simulation, andthe resulting bits of error produced. About 0:3% of results are greater than 2 bits of errorwhich indicates that the error bound of a CORDIC processor is positioned between theupper and lower bounds of error.Bits error
1
2
3
30
210
60
240
90
270
120
300
150
330
180 0
Figure 3.3: A plot showing bits of error for a typical test vector rotated through allpossible angles.The simulation results indicate that n + log n + 2 is an over estimation of data pathwidth required and a reduction in datapath width is possible if the number of iterationsis increased. Simulation results of two 8 stage CORDIC processors with 12 bit and 8 bitdatapaths, are shown for comparison in Figure (3.5) and Figure (3.4) respectively. Thesimulation results were obtained by varying the magnitude of v and � in uniform steps.The di�erence in resolution obtained is two bits, indicating that the lower bound of erroris closer to the error bound of CORDIC.3.3 Reducing the z update errorIn the rotational mode of CORDIC, � converges towards zero by adding/subtracting sub-rotation angles and the �nal iterations of the zi update will result in numbers approachingzero. More precisely, the angular error zi is approximately equal to 2�i, thus for a buswidth m, only (m� i) bits are used to represent error.To reduce the zi error a oating point system could be used, but it has complexhardware implementations not suited to word-parallel structures. A simpler method to16
0.17
0.33
0.50
0.66
0.83
1.0
30
-150
60
-120
90
-90
120
-60
150
-30
180 0
Figure 3.4: A 12 bit, 8 stage CORDIC processor produces 9 bit accurate results. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
30
-150
60
-120
90
-90
120
-60
150
-30
180 0
Figure 3.5: An 8 bit, 8 stage CORDIC processor produces 7 bit accurate results.17
improve accuracy, ie., to utilise all m bits, a quasi- oating point scheme or normalisationscheme could be implemented by scaling the existing sequence by 2i, ie.,zi = 2i � ziTherefore, the new sequence becomeszi+1 = 2i+1 � zi+1= 2 � 2i � (zi � qi�i)= 2 � (2i � zi � qi � 2i � �i)= 2(zi � qi�i) (3.1)which requires a shift left at each iteration, and requires no extra hardware for a word-parallel structure. A new sequence of sub-rotation angles can be de�ned as:�i = 2i�i = 2i tan(2�i) (3:2)where �i approaches a �nite value of 1 for increasing values of i, and will utilise most ofthe bus width. Since the scaling system results in full use of the databus width, over owmay occur if the bus width is too small. Using Equation (3.1), the maximum value zi+1can have is when zi approaches zero, givingmax[zi+1] � 2 �max[�i] (3:3)To calculate the increase in accuracy is beyond the scope of this report, however,simulation indicates that there is a direct improvement in accuracy. The simulationresults indicated that using the traditional scheme the accuracy of the rotation isaccuracy / log(zi datapath width) + log(number of stages) (3:4)whereas the normalisation scheme has the advantage ofaccuracy / log(number of stages) (3:5)since the z datapath is always in a semi-normalised state.Using the traditional scheme, �i ! 0, limiting the number of useful stages. Howeverwhen normalised, there is no limit on the number of stages and a signi�cant reduction inhardware is possible by reducing buswidth of z.Figure (3.6) illustrates the error dependencies on the number of stages and bits forthe scaled and unscaled CORDIC processors. Figure (3.6(a)) and Figure (3.6(b)) showthe angular expansion error. Figure (3.6(c)) and Figure (3.6(d)) show the dependance ofv error on the angular expansion error. 18
010
200
1020
0
2
4
x 10-3
bitsstages
an
gle
exp
an
s.
err
or
Alpha scaling
010
200
1020
0
2
4
6
x 10-3
bitsstages
an
gle
exp
an
s.
err
or
No alpha scaling
010
200
1020
0
2
4
bits in zstages/bits in v
rela
tive
v e
rro
r
No alpha scaling
010
200
1020
0
2
4
bits in zstages/bits in v
rela
tive
v e
rro
r
Alpha scaling
Figure 3.6: Simulation results from a CORDIC processor illustrating the e�ects of thenormalisation scheme. 19
3.4 Unexpected Truncation ErrorsUsing �xed point arithmetic in a CORDIC processor will introduce an unexpected trunca-tion error. The error occurs when the vector (x; y) has a negative component. Considerthe �nal iterations where the update of vector v approaches 0 since a larger number ofright shifts is performed at each iteration. However this is not the case if x or y is negative.For example, let xi!N equal some number hex X"2D", or positive 45. The right shiftedvalue of xi!N approaches zero. However, the negative of X"2D" in twos-complement formis X"D3" and the right shifted value will produce a number approaching X"FF", or �1,not the expected zero.This is a signi�cant problem in the CORDIC processor, since the addition of extraiterations will only increase the error. A simple method of removing this error would be toround the shifted value, instead of the forced truncation. A simple method for roundingvalues is to add the bit that was last shifted out to the shifted value.The rounder could be implemented using a half-adder and typically requires threelogic gates per bit to implement. Minimal extra hardware is required in the word-serialarchitecture, however a word-parallel structure requires two half-adders per stage. Thiswill have a direct e�ect on the performance of the processor with the additional delay.Figure (3.7) are the simulation results of two CORDIC processors, with and without,rounding units. The test vector was rotated in steps of 5�, through 360� and the roundedresults are signi�cantly more accurate. The rounding maintains monoticity in the actualangle of rotation as well as uniform magnitude. 32.95
30
-150
60
-120
90
-90
120
-60
150
-30
180 0
32.95
30
-150
60
-120
90
-90
120
-60
150
-30
180 0
Figure 3.7: An 8 bit, 8 stage CORDIC processor (a) without rounding, (b) with rounding.20
Chapter 4VHDL ImplementationVarious tools can be used to implement the CORDIC processor, however, a standardisedapproach to this problem would unify the solution for further development in variousapplications. A VHDL (VHSIC Hardware Description Language) has been used hereto describe the structural and behavioural characteristics of a Word-Parallel CORDICprocessor. VHDL has become the standard of hardware description languages and has itsown IEEE standard [12].4.1 The Basic CORDIC UnitAny CORDIC structure will involve a basic unit containing three adders/subtracters, asshown in Figure (4.1). The binary scaler would be variable in the case of a Word-Serialdevice, however, much simpler in the Word-Parallel device as a shift translates directlyto a misalignment of the data bus. Cell i �ixi zizi+1xi+1yiyi+1Figure 4.1: The basic CORDIC unit.This unit and a suitable FSM and registers could form a word-serial structure. Aword-parallel implementation can be obtained by linking n CORDIC units.The rest of this chapter deals with development of a Word-Parallel unit and the in-terconnection of these devices using the VHDL language. It should be a relatively trivialtask, but unfortunately there are many bugs in the Viewlogic VHDL Synthesiser, as wellas only containing a subset of the full VHDL standard.The main aim of the project was to describe a CORDIC processor using the VHDLlanguage and to allow the application designer to change the size of structure easily. This21
exibility could include fundamental changes such as variable datapath widths and vari-able number of stages. Other options such as rounding intermediate nodes and pipeliningcould also be easily integrated.Currently, Viewlogic's VHDL is a partial implementation of the 1987 IEEE StandardVHDL, and many constructs are missing from their implementation. However, most ofthe useful constructs are there, but contain nasty ambiguous messages following to saysorry this only works partially. This made it very di�cult to work with.4.2 VHDL Describes Structure and BehaviourVHDL has the ability to describe a design in two ways� in terms of its component structure,� in terms of behavioural functionality of the designand also the possibility of integrating the two streams. A requirement for structuraldescriptions is that the lowest level description will be a behavioural description to ensureportability between di�erent synthesis libraries. An example of a lowest level operator isthe logical operator AND (behavioural), and used to describe the ANDing of two operands.This may be synthesised as an AND standard cell from the library. In this way, there isno way of directly accessing a component from a cell library and limiting portability.Consider a slightly more complex design of an n-bit adder/subtracter, which could bedescribed by the following behavioural description:addsub : PROCESS(a,b,sel)VARIABLE res : VLBIT_VECTOR(n DOWNTO 0);BEGINres := zero(n DOWNTO 0); -- needs to be initialisedIF sel = '1' THENres := add2c(a,b);ELSEres := sub2c(a,b);END IF;s <= res(n-1 downto 0); -- discard coutEND PROCESS;The process activates when one of the variables in the sensitivity list changes, andthen produces a result in the internal variable res. The signal s is assigned the lowerportion of the sum. Now consider a structural description of the same adder/subtracterwhere several components are used:c(0) <= sel; -- carry inconnect: FOR i IN 0 TO n-1 GENERATE 22
invert: invf101 PORT MAP( b(i), b_bar(i) );mux_b_b_bar: muxf201 PORT MAP( b_bar(i), b(i), sel, b_hat(i) );addsub: faf001 PORT MAP( a(i), b_hat(i), c(i), s(i), c(i+1) );END GENERATE;Note that the muxf201 component is used to select between the non-inverted andinverted signals of the b bus. The components are user de�ned entities describing theappropriate logic gates. For example a fragment of the faf001 component contains thefollowing lowest level behavioural description:SUM <= A1 xor B1 xor CIN2;CO <= (A1 and B1) or (A1 and CIN2) or (B1 and CIN2);It is not immediately obvious which way a designer should describe a particular design,however the next section reveals the results of the synthesiser on which a decision may bebased. In general however, the easier it is for a designer to write a design in VHDL, themore optimisation the synthesiser needs to perform.4.2.1 Hierarchical vs Flat DesignsOne of very useful features of Viewlogic's VHDL Synthesiser[13] is the ability to eithercreate a hierarchical (top-down) or a at (bottom-up) design. A hierarchical design allowsthe engineer to see lower level interconnections between design units, unlike the at designwhere no (or little) hierarchy can be seen. This allows easier debugging of designs, howeverits has the disadvantage of being less e�cient than a at design which combines all thedesign elements together into one circuit, and then performs optimisation.Figure (4.2) illustrates the previous structural design of the Adder/Subtracter whereit can be observed that the schematic consists of higher level components than standardlibrary cells. This feature of Viewlogic VHDL enables easy debugging of high level com-ponents when compared to a at design. It is relatively simple to navigate between levelsin a design.However, most libraries contain standard cells for full adders, muxes, and inverters, butremembering that VHDL doesn't allow direct access to Library cells, these componentshad to be described by a behavioural description. A mux simply maps to an IF statement,however no behavioural description will map to the full adder cell, and resort to thedescription stated previously.Compiling the same design using the at (bottom-up) design approach the synthesiserproduces the following statistics, if for example, using the X2000 library. The schematicgenerated by the synthesiser is shown in Figure (4.3).*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:NAND2 15 0.25 X2000:OR2 3 0.2523
SEL S1
A1
B1
FAF001
A1B1CIN2
SUMCO
INVF101
A1 O
MUXF201
A1B2SEL3
O
A2 S2
FAF001
A1B1CIN2
SUMCO
A3
FAF001
A1B1CIN2
SUMCO
S3
B0
S0A0
INVF101
A1 OFAF001
A1B1CIN2
SUMCO
MUXF201
A1B2SEL3
O
INVF101
A1 O MUXF201
A1B2SEL3
OINVF101
A1 O
B2
B3
MUXF201
A1B2SEL3
O
Figure 4.2: A Hierarchical Design of the Adder/Subtracter for n = 4.X2000:XOR2 15 0.25----------------------------------------------------------------------------Total Cells : 33 Total Area : 8.25*********************************************Netlist Statistics*********************************************Maximum level of gates = 14 Total number of nets = 42A3
S0
S3
A0
B3
XOR2XOR2
XOR2
XOR2
XOR2
S1
A2
XOR2
B2
S2
XOR2
XOR2
B1
XOR2
NAND2
XOR2
NAND2
NAND2
B0XOR2
SEL
NAND2
NAND2
A1
NAND2
NAND2OR2
NAND2
OR2
NAND2XOR2NAND2
NAND2XOR2
NAND2
NAND2
OR2
NAND2
XOR2NAND2
XOR2
Figure 4.3: A Flat Design of the Adder/Subtracter for n = 4.Reconsidering the behavioural description of the Adder/Subtracter and synthesizingthe design, the following statistics are generated, and the corresponding schematic shownin Figure (4.4).*********************************************Gate Usage Summary*********************************************24
Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:AND2 21 0.25 X2000:AND3 1 0.50X2000:INV 11 0.00 X2000:NAND2 8 0.25X2000:OR2 17 0.25 X2000:XOR2 3 0.25----------------------------------------------------------------------------Total Cells : 61 Total Area : 12.75*********************************************Netlist Statistics*********************************************Maximum level of gates = 11 Total number of nets = 70SEL
AND2
INV
OR2
AND2
AND2
AND2
AND2
OR2
NAND2 AND2
NAND2INV
NAND2
AND2
AND2OR2
AND2
AND2
OR2
INV
S1
XOR2
B1
INV
AND2 OR2
S3
OR2
A3
B3
XOR2
NAND2
OR2
S2
AND2
A0
OR2
INV
B0
OR2
AND2
OR2
INV
AND3OR2
S0
A1
INVINV
AND2
AND2
OR2
A2
B2
XOR2
INVAND2
INV
AND2
AND2 OR2
AND2
AND2OR2
INV
AND2
OR2NAND2
NAND2
OR2NAND2
NAND2
OR2
Figure 4.4: A Behavioural Design of the Adder/Subtracter for n = 4.From the statistics of each design, it is important to note that the total area and themaximum level of gates di�ers. The structural description produces a small but slowdesign when compared to the behavioural description which produces a fast but largedesign.A characteristics of the synthesiser is that a behavioural description maps to a struc-ture by representing each output in terms of its inputs, much like a lookup table, andremoves any structure. The synthesizer performs logic level optimisation on a the struc-tural description and thus producing a design with less logic.4.2.2 The Viewlogic SynthesiserThe Viewlogic Synthesiser has the ability to alter the emphasis on speed or area whenoptimizing a design. The statistics generated in the previous section were area optimized,25
and neglected the e�ect of gate delays. For example, optimizing the behavioural design forspeed, the synthesiser generates 14 more gates than before, however there is a signi�cantdecrease in the maximum level of gates:*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:AND2 10 0.25 X2000:AND3 2 0.50X2000:AND4 1 0.75 X2000:INV 15 0.00X2000:NAND2 17 0.25 X2000:NAND3 1 0.50X2000:NAND4 1 0.75 X2000:NOR3 2 0.50X2000:NOR4 2 0.75 X2000:OR2 22 0.25X2000:OR4 1 0.75 X2000:XOR2 1 0.25----------------------------------------------------------------------------Total Cells : 75 Total Area : 18.75*********************************************Netlist Statistics*********************************************Maximum level of gates = 9 Total number of nets = 84The synthesiser can optimise small designs, but when the design grows large, thememory and processing power required to optimize such a design is considerable. Thedesign of the CORDIC unit contains three adders/subtracters and takes several minutesto compile and optimize the design. However, integrating this unit into a larger design ofseveral units, the compiler has many problems and will eventually lead to a crash afterhalf an hour of compilation.A solution to get around this optimization problem is to use a hierarchical ow anddescribe the components using behavioural or structural descriptions. Using this methodthe compiler knows nothing about large components and cannot perform any global op-timization. This is not a fully optimized solution, but it is currently the best solution.However, it is possible to atten the design below the top level making the design slightlymore e�cient.4.3 VHDL Design of the CORDIC UnitThe �rst stage of the design of a CORDIC processor is to create the CORDIC unit, wheretwo approaches can be taken: a behavioral description or a structural description. Firstly,consider the following behavioural description where the shifted values of (xi; yi) are doneexternal to the CORDIC unit in the top level design. This approach is optimal, since itonly requires a misalignment of the data buses in the top level interconnections.However, if contained inside the CORDIC unit, each unit would require a variableshifter and could not be optimized using the current version of Viewlogic VHDL for reasonsdiscussed previously. Another reason why shifting is done external to the CORDIC unit26
is that the LOOP variable inside the generate statement cannot be passed to any userde�ned function, procedure or entity. This is not stated in the manual and took manydays to determine the problem.The behavioural description is as follows:ARCHITECTURE behaviour OF adder ISbegincell_i : process (xi,xs,yi,ys,zi,ai)VARIABLE x_res: vlbit_vector(n downto 0); -- temporary resultsVARIABLE y_res: vlbit_vector(n downto 0);VARIABLE z_res: vlbit_vector(k downto 0);beginx_res := zero(n downto 0); -- initialise, unless comp complainsy_res := zero(n downto 0);z_res := zero(k downto 0);if zi(k-1) = '0' then -- z_i is positivex_res := add2c (xi, ys);y_res := sub2c (yi, xs);z_res := sub2c (zi, ai);else -- z_i is negativex_res := sub2c (xi, ys);y_res := add2c (yi, xs);z_res := add2c (zi, ai);end if;xip1 <= x_res (n-1 downto 0);yip1 <= y_res (n-1 downto 0);zip1 <= z_res (e-1 downto 0);end process;END behavior;The synthesiser generates the following statistics for a 8 bit version of the code. Themaximum level of gates is 20, since each bit requires 2 levels, plus additional gates for themultiplexer and inversion.*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:AND2 159 0.25 X2000:AND3 3 0.50X2000:INV 69 0.00 X2000:NAND2 76 0.2527
X2000:OR2 125 0.25 X2000:XOR2 7 0.25----------------------------------------------------------------------------Total Cells : 439 Total Area : 93.25*********************************************Netlist Statistics*********************************************Maximum level of gates = 20 Total number of nets = 487For the Structural description of the CORDIC unit is slightly more complex and isbest represented pictorially, as shown in Figure (4.5). Each box in the �gure representsa di�erent VHDL entity (component), and some components are used more than once.The design is very bulky and easier to make mistakes.FullAdderfaf001.vhd2to1muxmuxf201.vhdinv101.vhdINV addsub n.vhdFullAdderfaf001.vhd2to1muxmuxf201.vhdinv101.vhdINV FullAdderfaf001.vhd2to1muxmuxf201.vhdinv101.vhdINVyixs addsub n.vhdaddsub e.vhd zip1xip1yip1
ziaiysxiadders.vhdFigure 4.5: The structure of CORDIC unit showing the various entities.It achieves the same functionality as the behavioural description but requires a lot moree�ort to make sure all the connections are correct. As stated previously, the structuraldesign will minimise area, but will result in a slower design, as re ected by the followingsynthesiser statistics. 28
*********************************************Gate Usage Summary*********************************************Cell Count Area/Cell Cell Count Area/Cell----------------------------------------------------------------------------X2000:INV 3 0.00 X2000:NAND2 139 0.25X2000:OR2 41 0.25 X2000:XOR2 75 0.25----------------------------------------------------------------------------Total Cells : 258 Total Area : 63.75*********************************************Netlist Statistics*********************************************Maximum level of gates = 31 Total number of nets = 306Using the structural design will save about 30% on area but will execute 50% slower.In a FPGA implementation speed might be more desirable than area optimization sincethe devices operate relatively slower when compared to a custom VLSI device. A 30%increase in the number of gates will be a relatively small concern.4.3.1 The Rounding UnitThe rounding unit is formed by the interconnection of n half adders, or in behaviouralterms, the addition of the bit shifted out during the shifting process. Describing it struc-turally involves using the inc001 component which contains an AND and a XOR gate toform a half adder. The interconnection of the inc001 components is:c(0) <= cin; -- first carryconnect: for i in 0 to n-1 generateaddsub: inc001 port map( a(i), c(i), s(i), c(i+1) );end generate;Or, a much simpler behavioural description is created using the unsigned addition routineaddum. This avoids the sign extension used in the add2c routine.rounder : process (a,cin)VARIABLE res: vlbit_vector(n downto 0); -- temporary resultsbeginres := zero(n downto 0); -- initialise, unless comp complainsres := addum(a,cin); -- use addum instead of add2c as it sign-- extends the cin input making it -1 not +1s <= res (n-1 downto 0);end process; 29
4.4 Combining the CORDIC UnitsThe process of combining the CORDIC and Rounding units involves writing the top leveldesign in the hierarchical solution. As before with structural descriptions, the generatestatement is used and allows iterative or conditional generation of a portion of description.The �rst de�nition to be made in top level �le is the alphai constants, and thisversion implements the Alpha Normalisation Scheme. Next the x; y; z intermediate signalsbetween CORDIC units are shifted by the appropriate amount. The function shift all isde�ned in another �le and contains user de�ned functions. This operation is required heresince execution inside the generate statement will not work since concurrent procedurecalls only execute when a variable in the sensitivity list changes state. A change in theshift value is not recognizable inside the generate statement.-- Scaled a_i * 2^i values are decimal 45 53 56 57 57 57 57 57ai <= X"39_39_39_39_39_38_35_2D";sh_x: xis <= shift_all(xi); -- shift intermediate signalssh_y: yis <= shift_all(yi);sh_z: zis <= shift_z(zi);It should be noted that the variables xis, yis, zis, xi, yi, and zi are large vectorscontaining several smaller vectors. This system had to be used since Viewlogic's VHDLcannot handle two-dimensional arrays of vlbit. The shifting of intermediate signals isdone by the following function:FUNCTION shift_all (x : vlbit_vector (n*(k-1)-1 downto 0))RETURN vlbit_vector ISVARIABLE x_s : vlbit_vector(n*(k-1)-1 downto 0) := zero(n*(k-1)-1 downto 0);BEGINx_s(1*n-1 downto 0) := shiftr2c(x( 1*n-1 downto 0 ),1); -- 2 stagex_s(2*n-1 downto 1*n) := shiftr2c(x( 2*n-1 downto 1*n ),2); -- 3 stagex_s(3*n-1 downto 2*n) := shiftr2c(x( 3*n-1 downto 2*n ),3); -- 4 stagex_s(4*n-1 downto 3*n) := shiftr2c(x( 4*n-1 downto 3*n ),4); -- 5 stagex_s(5*n-1 downto 4*n) := shiftr2c(x( 5*n-1 downto 4*n ),5); -- 6 stagex_s(6*n-1 downto 5*n) := shiftr2c(x( 6*n-1 downto 5*n ),6); -- 7 stagex_s(7*n-1 downto 6*n) := shiftr2c(x( 7*n-1 downto 6*n ),7); -- 8 stagex_s(8*n-1 downto 7*n) := shiftr2c(x( 8*n-1 downto 7*n ),8); -- 9 stagex_s(9*n-1 downto 8*n) := shiftr2c(x( 9*n-1 downto 8*n ),9); -- 10 stagereturn x_s;END shift_all;Next comes the connection of the init component which is used to expand the convergencerange of the CORDIC processor to �190� < z < 190�. The input signals are x in, y in,z in are connected to a unit simular to the CORDIC unit, except there is an extra bitappended to the alpha bus to account for the expanded convergence range.30
initial: init port map(xi <= X"00",xs <= x_in,yi <= X"00",ys <= y_in,zi <= z_in,ai <= B"0_0101_1010", -- add/sub 90 degreesxip1 <= xinit, -- xinit = 0 +- yinyip1 <= yinit, -- yinit = 0 -+ xinzip1 <= zinit );The following code has been compressed to reduce detail, however it can be seen that therea three separate stages: initial connection, intermediate connections, and �nal connection.This can be visibly seen in Figure (4.6). (Also not shown is the conditional generation ofcomponents, eg., selection of behavioral or structural components, rounding units, etc.)connect: for i in 0 to k-1 generate -- k stagesls_unit: if i=0 generatefirst_unit: adder port map( ... );end generate ls_unit;i_unit: if i>0 and i<k-1 generatex_round: round port map ( ... );y_round: round port map ( ... );middle_units: adder port map( ... );end generate ls_unit;ms_unit: if i=k-1 generatex_round_last: round port map ( ... );y_round_last: round port map ( ... );last_unit: adder port map( ... );end generate ms_unit;end generate connect;The contents of ... are simular to the port map of the init component.4.4.1 A SolutionThis represents a solution to the CORDIC problem, and is close to a optimized solu-tion, but due to compiler and language di�culties a completely optimized solution is notpossible. Under these situations the design has been optimised as far as possible though.There many choices to be made about the design of the CORDIC unit, by decidingon whether the it is going to be area or speed e�cient.31
WIR:cordic
SCH:cordic
cordic
SHEET 1 OF 124 Jul 94 16:30
1
2
1
2
FEDCBA
FEDCBA
A_IN0
A_IN1
A_IN2
A_IN3
A_IN4
A_IN5
A_IN6
A_IN7
A_IN8
A_OUT0
A_OUT1
A_OUT2
A_OUT3
A_OUT4
A_OUT5
A_OUT6
A_OUT7
X_IN0
X_IN1
X_IN2
X_IN3
X_IN4
X_IN5
X_IN6
X_IN7
X_OUT0
X_OUT1
X_OUT2
X_OUT3
X_OUT4
X_OUT5
X_OUT6
X_OUT7
Y_IN0
Y_IN1
Y_IN2
Y_IN3
Y_IN4
Y_IN5
Y_IN6
Y_IN7
Y_OUT0
Y_OUT1
Y_OUT2
Y_OUT3
Y_OUT4
Y_OUT5
Y_OUT6
Y_OUT7
INIT
XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI8AI7AI6AI5AI4AI3AI2AI1AI0TI8TI7TI6TI5TI4TI3TI2TI1TI0
XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10
ADDER
XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0
XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10
ROUND
A7A6A5A4A3A2A1A0CIN
S7S6S5S4S3S2S1S0
ROUND
A7A6A5A4A3A2A1A0CIN
S7S6S5S4S3S2S1S0
ADDER
XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0
XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10
ROUND
A7A6A5A4A3A2A1A0CIN
S7S6S5S4S3S2S1S0
ADDER
XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0
XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10
ROUND
A7A6A5A4A3A2A1A0CIN
S7S6S5S4S3S2S1S0
ADDER
XI7XI6XI5XI4XI3XI2XI1XI0XS7XS6XS5XS4XS3XS2XS1XS0YI7YI6YI5YI4YI3YI2YI1YI0YS7YS6YS5YS4YS3YS2YS1YS0AI7AI6AI5AI4AI3AI2AI1AI0TI7TI6TI5TI4TI3TI2TI1TI0
XIP17XIP16XIP15XIP14XIP13XIP12XIP11XIP10YIP17YIP16YIP15YIP14YIP13YIP12YIP11YIP10AIP17AIP16AIP15AIP14AIP13AIP12AIP11AIP10
ROUND
A7A6A5A4A3A2A1A0CIN
S7S6S5S4S3S2S1S0
ROUND
A7A6A5A4A3A2A1A0CIN
S7S6S5S4S3S2S1S0
VDD
GND
Figure4.6:Thetoplevelschematicofan4stageCORDICprocessorwithIncreasedConvergenceRangeandRoundingcomponents.32
The user can exibly change the characteristics of the CORDIC processor by chang-ing the value of a few constants to achieve more or less accuracy as well as hardwarecon�gurations.Some of the CORDIC hardware statistics generated are:Type Internal Bus Width Stages Rounding Number of GatesBehavioural 12 bit 8 no 5841Behavioural 12 bit 10 no 7139Behavioural 12 bit 12 no 8437Behavioural 8 bit 8 no 3753Behavioural 8 bit 8 yes 5060Structural 8 bit 8 no 2313Structural 8 bit 8 yes 2775Table 4.1: Some CORDIC hardware statistics.(Remember that there is also one additional stage for increasing the convergence range.)4.5 ImprovementsThere is one main improvement which could be made to the current design, which isto include pipelining registers between stage(s). If the Xilinx FPGA was used no extrahardware for latches is necessary as each cell contains a latch.Another possibility is to design a Word-Serial Cordic architecture around the alreadydesign CORDIC unit. This would only require a FSM driver along with some additionalhardware for the variable shifter.33
ConclusionThe theory behind the CORDIC algorithm has been covered in detail and its possibleapplications in array imaging discussed.It was shown how Kota predicted the upper bound on CORDIC errors, however sim-ulations reveal CORDIC to be signi�cantly more accurate. A normalisation scheme onthe z datapath was introduced to maximize bus usage, and hence increase accuracy. Thisscheme can reduce the bus width required for z, and still achieve greater than or thesame accuracy. Also observed was the unexpected truncation errors introduced by thetwos-complement binary format, and minimised using a half adder to perform a roundingoperation. A method for increasing the convergence range to �180� < � < 180� in oneextra iteration was also introduced.Various CORDIC architectures were discussed and a word-parallel architecture de-scribed using the VHDL hardware description language. A few design issues and imple-mentation problems were discussed, and concluded that a hierarchical design ow had tofollowed to avoid compiler memory problems. The solution is not completely optimal, butthe resulting design could easily be implemented on a FPGA gate array.The VHDL design of the CORDIC processor could now be easily integrated into anyapplication as alterations in the con�guration of the CORDIC processor can easily beachieved using the VHDL language.34
Appendix ACORDIC FunctionsThe functional results from a CORDIC processor are derived from the initialisation ofthe three input variables: x, y, and z and the subsequent mode of operation selected.Equations (1.15,1.16) can be rewritten into a general solution, from which six modesof operation are possible, with the introduction of a mode variable m:xi+1 = xi +m � qiyi � 2�i (A.1)yi+1 = yi � qixi � 2�i (A.2)zi+1 = zi � qi�i (A.3)where m can take on the following values:�i = 8><>: atanh (2�i) if m = �12�i if m = 0arctan(2�i) if m = +1 (A:4)The three modes of m determine the class of function being evaluated: linear (m = 0),circular (m = �1) or hyperbolic (m = +1). The values of �i were previously given for thecircular functions mode and simular table could be given for the hyperbolic functions.The six modes of operation exist because there are two sub-classes available dependingupon whether the iterations seek to drive the variable y or z towards zero. Table (A.1)summarises all of the functions available with this con�guration.35
Hyperbolic Linear Circularm = �1 m = 0 m = +1Mode i = f0;1; 2; : : : ; N � 1g i = f0;1;2; : : : ;N � 1g i = f0;1;2; : : : ; N � 1g(Repeat for i = f4;13;40; : : :g)qi = n +1 if zi < 0�1 if zi � 0 qi = n +1 if zi < 0�1 if zi � 0 qi = n +1 if zi < 0�1 if zi � 0z ! 0 xN � KN (xin cosh(zin) + yin sinh(zin) xN = xin xN � KNxin cos(zin)� yin sin(zin)yN � KN (xin sinh(zin) + yin cosh(zin)) yN = yin + xinzin yN � KN(xin sin(zin)� yin cos(zin))jzinj � 1:1182 jzinj � 1 jzinj � 1:7433 (99:9�)qi = n +1 if xiyi � 0�1 if xiyi < 0 qi = n +1 if xiyi � 0�1 if xiyi < 0 qi = n +1 if yi � 0�1 if yi < 0y ! 0 xN � KNpx2in � y2in xN = xin xN � KNpx2in + y2inzN � zin + atanh ( yinxin ) zN = zin + yinxin zN � zin + atan2 (yin; xin)jatanh(yin=xin)j � 1:1182 jyin=xinj � 1 jatan2 (yin; xin)j � 1:7433 (99:9�)Table A.1: The six CORDIC modes.36
Appendix BUpper Bound of CORDIC ErrorKota and Cavlallaro calculated the numerical accuracy of the CORDIC algorithm byexamination of truncation and approximation errors.They concluded that for a CORDIC processor with all data paths being m bits wide,and the number of iterations being n, then the upper bound of error was shown to be:Eu = 2�n + 3:5 � 2�m � n (B:1)This cannot be solved analytically, but a numerical solution (that is for any given m)can be approximated graphically. The solution approximates toInternal Bus Width = m � n + log2 n+ 2 (B:2)Hence to obtain a precision of n bits, a CORDIC processor with (n+ log2 n+2) bits andn iterations would be su�cient. This solution represents the upper bound of error.
37
Bibliography[1] J. E. Volder, \The CORDIC trigonometric computing technique," IRE Transactionson Electronic Computing, vol. EC-8, no. 3, pp. 330{334, 1959.[2] Y. H. Hu, \CORDIC-based VLSI architectures for digital signal processing," IEEESignal Processing Magazine, pp. 16{35, July 1992.[3] F. Koscsis and J. Bohme, \Fast algorithms and parallel structures for form factorevaluation," The Visual Computer, no. 8, pp. 205{216, 1992.[4] M. Kameyama, T. Amada, and T. Higuchi, \Highly parallel collision detection pro-cessor for intelligent robots," IEEE Journal of Solid-State Circuits, vol. 27, no. 4,pp. 500{506, 1992.[5] M. O'Donnell et al., \Real-time phases array imaging using digital beam forming andautonomous channel control," Ultrasonics Symposium, pp. 1499{1502, 1990.[6] G. Hampson and A. Papli�nski, \Beamforming by interpolation," Tech. Rep. 93-12,Monash University, 1993.[7] G. L. Haviland and A. A. Tuszynski, \A CORDIC arithmetic processor chip," IEEETransactions on Computers, vol. C-29, no. 2, pp. 68{79, 1980.[8] J. Duprat and J.-M. Muller, \The CORDIC algorithm: New results for fast VLSIimplementation," IEEE Transactions on Computers, vol. 42, pp. 168{178, February1993.[9] A. Papli�nski, \Array processor units for evaluating the expotential and logarithmicfunctions," Tech. Rep. TR-CS-82-07, The Australian National University, 1982.[10] J. S. Walther, \A uni�ed algorithm for elementary functions," Proceedings AFIPSSpring Joint Computer Conference, pp. 379{385, 1971.[11] K. Kota and J. R. Cavallaro, \Numerical accuracy and hardware tradeo�s forCORDIC arithmetic for special-purpose processors," IEEE Transactions on Com-puters, vol. 42, pp. 769{779, July 1993.[12] Experts, \IEEE Std 1076-1987, IEEE Standard VHDL Language ReferenceManual,"IEEE Computer Society, February 1992.[13] ViewLogic, VHDL Reference Manual for Synthesis, Powerview 5.1.3 release ed.38