Upload
cale-spratt
View
57
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Division and square root are common functions used in digital systems such as microprocessors. Microprocessors that have hardware to perform division and square root are able to do those functions with fewer clock cycles, and more power effictiently. This implementation utilizes a 36 bit multiplication unit to implement the Goldschmidt algorithm with enough precision to give a large amount of accuracy, and it also utilizes a two stage pipeline in order to increase the efficiency of the implementation.
Citation preview
1
ECEN 4233 Implentation of Goldschmidts algorithm for 16 bit division and square root
Cale Spratt and Jeremy Storm, Oklahoma State University
Abstract Division and square root are common functions used in digital systems such as microprocessors. Microprocessors that
have hardware to perform division and square root are able to do those functions with fewer clock cycles, and more power
effictiently. This implementation utilizes a 36 bit multiplication unit to implement the Goldschmidt algorithm with enough precision to
give a large amount of accuracy, and it also utilizes a two stage pipeline in order to increase the efficiency of the implantation.
Index TermsGoldschmidts Algorithms, Carry Propagate Adder, Full Adder, Half Adder, Carry Save Array Multiplier
1 INTRODUCTION
In this implemtation, the designed hardware can per-
form these operations:
a) Division
b) Square Root
c) Multiplication
Division and square root are common mathematical
functions. Due to the importance of the two functions,
it is beneficial to create an implementation that runs
efficiently and that produces accurate, reliable results
after each use. To decrease the amount of clock cycles
that are required to perform division or square root, the
design has been pipelined, allowing for two different
divisions or or square roots to be performed at the
same time.
Implementing Goldschmidts algorithms requires the
use of a multiplier, and because of this, the designed
hardware can also implement multiplication, with the
addition of a few control lines, allowing for this module
to be even more functional. Most applications that uti-
lize any division method are dependent upon the preci-
sion expectations that the hardware outputs. For our
implementation we assumed lower precision with a sign
bit, integer bit, and 14-bits of precision for values falling
between [1,2) and (-2,-1]. If our design had requested a
larger division subset then we would have expanded the
precision and implemented a modified approximation
protocol to account for more effective number of
interations. If a division is to occur then the approxima-
tion muxes with select the corresponding sign value of
0.75 or -0.75, otherwise a square root is assumed and
the value will be 0.833. It is very important to select the
correct approximation to create the proper division. The
value must always fall to the left of the quotient when
on an exponentially decreasing graph. Our module uses
the control muxes, A and B, to select the registers
or input values we are going to propagate through the
multiplier. Specific output values have been mapped to
certain registers to create a specific pipelining process.
2 DIVISION
2.1 Goldschmidt Method
The Goldschmidt method is a method for approximating
the zeroes of a real function. This method calculates Q =
N/D. It is an iterative method that successively gives a
better approximation after each iteration. The amount
of bits of error grows quadratically after each successive
iteration. The equation for determining the quotient is
K1 is obtained from utilizing the intial approximation
that is provided from the mux unit dependent upon
mathematical operation.
The number of iterations needed is equal to (n/log_2(radix)), where n is the number of bits of precision.
2 HIGH SPEED PROJECT SPRAT, STORM
3 Hardware Modules
A. Initial Approximation
The initial approximation (IA) module consists of 3
16-bit mux units that cascade a combination of signed
approximations based on the mathematical function
requested. The inputs values are between [1,2) or (-2,-
1]. There are only three constant approximations of
0.75, -0.75, and 0.833 where the first two correspond to
specified signed divisions. The value 0.833 is the ap-
proximation used for a square root operation. A 2-bit
selection input is provided to the cascaded muxes to
select the operational approximation for calculation.
Area is 96 gates. Delay is 6.
B. Multiplexers
We have a total of three multiplexor sets that assist
in implementing the expected mathematical operation
for each interation of the pipeline stages. Multiplexor
A is used to delinieate between the 16-bit values from
the intitial approximation unit, inputs N and D, and the
two register units C and D. A 3-bit mux selection input
determines the 16-bit mux output value that is propa-
gated to our multiplier. All values passed to the multi-
plier through the muxes will be spliced with two addi-
tional bits for higher precision during multiplication. The
table below provides all selection bit operations for
multiplexor A. Area is 288 gates. Delay is 9.
Mux Select (3 bits) Output (16 bits)
000 ---
001 Regc
010 ----
011 D
100 Regd
101 IA
110 ---
111 N
Multiplexor B is implemented to select a 16-bit
output value from inputs of initial approximation, regis-
ter A and register B. This mux has the intented opera-
tion of provide the second multiplication value for our
CSAM multiplier. The logic table for multiplexor B is
listed below. Area is 96 gates. Delay is 6.
Mux Select (2 bits) Output (16 bits)
00 Rega
01 Regb
10 Rega
11 IA
Multiplexor Twos was devoted to propagating ei-
ther a 16-bit two or three to the 2s complement mod-
ule in order to perform either a division or square root.
The table for the logic is provided below. Area is 96
gates. Delay is 6.
Mux Select (1 bit) Output (16 bits)
0 16h4000
1 16h6000
SPRAT, STORM HIGH SPEED PAPER 3
The multiplexor for the 2s complement logic unit
has an area of 48 gates and 3 based on the layout of the
unit.
C. Signed Carry Save Array Multiplier (CSAM)
The Carry Save Array Multiplier (CSAM) is a pipe-
lined module to give the external register units the abil-
ity to store the appropriate multiplication value. Our
multiplier was instantiated to be 36-bits with two 18-bit
inputs from pre-module muxes. This multiplier gives
one integer and sign bits, with 16-bits of precision.
Three 18-bit register modules that incorporate latch
control were placed between the multiplication array
and the Carry Propagate Adders (CPA). This in required
to allow the expected pipelining capability. These regis-
ters store the carries and sum values for propagation
into the CPAs. The implementation was developed to
utilize 16-bit external registers so all the inputs and
outputs have to be extended or spliced to meet the
processing requirements. The most-significant two bits
are always removed following the CPA sum being calcu-
lated. In other words we have performed the same
function as a rounding unit would. Area is 2433 gates.
Delay is 37 .
D. 2s Complement Module
The complement subtractor has to be integrated
to perform both division and square root functions. An
input mux determined the complement value designat-
ed for the intended mathematical operation. For the
division operation, the 2s complement logic uses a 16-
bit value of 2 that implements as
an output from the 2s selection mux. A 16-bit value of
3 will be implemented for a square root operation. Our
2s complement module has 16-bit inputs and outputs
but they have to be modified to include one additional
integer bit for the multiplier value being utilized. We
concatenate one bit after the left-most bit for the mul-
tiplier and one bit to the right-most bit of the subtractor
value. After the subtraction is complete then the addi-
tional integer bit is removed before propagating the
output. Area is 144 gates for 16-bits. Delay is 55
E. Total Area and Delay
The total area for the entire block is the sume of the ini-
tial approximation unit, two input muxes,
squareroot/division mux, multiplier and the 2s comple-
ment logic unit. This area comes out to be 3105 gates.
The delay also follows the same process and comes out
to be roughly 122.
4
3. State Table
3.1 State Table for Division
CLK 0 1 2 3 4 5
mux_selecta 111 111 110 null 100 011
mux_selectb 11 11 11 null 00 00
mux_twos_select 1 1 1 1 1 1
rega_out 0 0 2-K*IA 2-K*IA 2-D*IA 2-D*K0*K1
regb_out 0 0 0 0 0 0
regc_out 0 N*IA N*IA N*IA N*K0*K1 N*K0*K1
regd_out 0 0 D*IA D*IA D*IA D*K0*K1
rega_load 0 0 1 null 0 1
regb_load 0 0 0 null 0 0
regc_load 0 1 0 null 1 0
regd_load 0 0 1 null 0 1
6 7 8 9 10 11
null 100 011 null 100 011
null 00 00 null 00 00
1 1 1 1 1 1
2-D*K0*K1 2-D*K0*K1 2-D*K0*K1*K2
2-D*K0*K1*K2 2-D*K0*K1*K2
2-D*K0*K1*K2*K3
0 0 0 0 0 0
N*K0*K1 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3
D*K0*K1 D*K0*K1 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3
null 0 1 null 0 1
null 0 0 null 0 0
null 1 0 null 1 0
null 0 1 null 0 1
12 13 14
null 100 11
null 00 00
AUTHOR: CALE SPRATT, JEREMY STORM 5
1 1 1
2-D*K0*K1*K2*K3 2-D*K0*K1*K2*K3
2-D*K0*K1*K2*K3*K4
0 0 0
N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4
D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4
null 0 1
null 0 0
null 1 0
null 0 1
3.2 State Table for Square Root
After every three iterations, the 16-bit output from the 2s complement module will be shifted by one bit before
propagating to the mux to create the division necessary for a square root.
CLK 0 1 2 3 4 5
mux_selecta 111 111 110 null 100 011
mux_selectb 11 11 11 null 00 00
mux_twos_select 0 0 0 0 0 0
rega_out 0 0 3-K*IA 3-K*IA 3-D*IA 3-D*K0*K1
regb_out 0 0 0 0 0 0
regc_out 0 N*IA N*IA N*IA N*K0*K1 N*K0*K1
regd_out 0 0 D*IA D*IA D*IA D*K0*K1
rega_load 0 0 1 null 0 1
regb_load 0 0 0 null 0 0
regc_load 0 1 0 null 1 0
regd_load 0 0 1 null 0 1
6 7 8 9 10 11
null 100 011 null 100 011
null 00 00 null 00 00
0 0 0 0 0 0
3-D*K0*K1 3-D*K0*K1 3-D*K0*K1*K2 3-D*K0*K1*K2 3-D*K0*K1*K2 3-D*K0*K1*K2*K3
6 HIGH SPEED PAPER SPRAT, STORM
0 0 0 0 0 0
N*K0*K1 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3
D*K0*K1 D*K0*K1 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3
null 0 1 null 0 1
null 0 0 null 0 0
null 1 0 null 1 0
null 0 1 null 0 1
12 13 14
null 100 11
null 00 00
0 0 0
3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3*K4
0 0 0
N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4
D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4
null 0 1
null 0 0
null 1 0
null 0 1
3 Error Analysis
1 ERROR ANALYSIS
1.1 Error Analysis for Division
The analysis for division and square root was performed using the excel spreadsheets provided in the course.
1.923 N IA 0.75 1.523 D
1.262639527 N/D
q*K r*K 2-D*Xi TRUE Error #bits
1.44225 1.14225 0.85775 1.262639527 0.179610473 -2.477056622
1.237089938 0.979764938 1.020235063 1.262639527 0.02554959 -5.290556064
1.26212253 0.999590542 1.000409458 1.262639527 0.000516998 -10.91755495
1.262639316 0.999999832 1.000000168 1.262639527 2.11689E-07 -22.17155272
1.262639527 1 1 1.262639527 3.59712E-14 -44.66015
1.262639527 1 1 1.262639527 6.66134E-16 -50.4150375
AUTHOR: CALE SPRATT, JEREMY STORM 7
2.
1.923 N IA 0.75 1.013 D
1.898321816 N/D
q*K r*K 2-D*Xi TRUE Error #bits
1.44225 0.75975 1.24025 1.898321816 0.456071816 -1.132667075
1.788750563 0.942279938 1.057720063 1.898321816 0.109571254 -3.190058739
1.891997357 0.996668394 1.003331606 1.898321816 0.00632446 -7.304842067
1.898300746 0.9999889 1.0000111 1.898321816 2.10706E-05 -15.53440872
1.898321816 1 1 1.898321816 2.33875E-10 -31.99354105
1.898321816 1 1 1.898321816 2.22045E-16 -52
3.
-1.012 N IA 0.75 1.9123 D
-0.529205669 N/D
q*K r*K 2-D*Xi TRUE Error #bits
-0.759 1.434225 0.565775 -0.529205669 0.229794331 -2.121584885
-0.429423225 0.811448649 1.188551351 -0.529205669 0.099782444 -3.32507019
-0.510391554 0.964448388 1.035551612 -0.529205669 0.018814115 -5.7320408
-0.528536796 0.998736083 1.001263917 -0.529205669 0.000668872 -10.54598202
-0.529204823 0.999998403 1.000001597 -0.529205669 8.45399E-07 -20.17386446
-0.529205669 1 1 -0.529205669 1.35048E-12 -39.4296699
4.
1.012 N IA 0.75 1.9123 D
0.529205669 N/D
q*K r*K 2-D*Xi TRUE Error #bits
0.759 1.434225 0.565775 0.529205669 0.229794331 -2.121584885
0.429423225 0.811448649 1.188551351 0.529205669 0.099782444 -3.32507019
0.510391554 0.964448388 1.035551612 0.529205669 0.018814115 -5.7320408
0.528536796 0.998736083 1.001263917 0.529205669 0.000668872 -10.54598202
0.529204823 0.999998403 1.000001597 0.529205669 8.45399E-07 -20.17386446
0.529205669 1 1 0.529205669 1.35048E-12 -39.4296699
8 HIGH SPEED PAPER SPRAT, STORM
In all the above cases, the amount of error was greater than 16 bits after five iterations, showing that the the de-
sign should be accurate to the amount of bits being used on the design.
1.2 Error Analysis for Square Root
1.
2.
1.231 N IA 0.853553391 1.231 D
1 N/D
q*K r*K (3-D*Xi)/2 TRUE Error #bits
1.050724224 0.896849224 1.051575388 1.109504394 0.05878017 -4.088526657
1.104915733 0.991745555 1.004127223 1.109504394 0.00458866 -7.767711238
1.109475967 0.999948757 1.000025621 1.109504394 2.84273E-05 -15.10236562
1.109504393 0.999999998 1.000000001 1.109504394 1.09252E-09 -29.76969629
1.109504394 1 1 1.109504394 2.22045E-16 -52
1.109504394 1 1 1.109504394 2.22045E-16 -52
1.99 N IA 0.853553391 1.99 D
1 N/D
q*K r*K (3-D*Xi)/2 TRUE Error #bits
1.698571247 1.449821247 0.775089376 1.410673598 0.287897649 -1.796372086
1.316544529 0.870999747 1.064500127 1.410673598 0.094129069 -3.409215861
1.401461818 0.986982526 1.006508737 1.410673598 0.00921178 -6.762304257
1.410583564 0.999872358 1.000063821 1.410673598 9.00338E-05 -13.43917395
1.410673589 0.999999988 1.000000006 1.410673598 8.61919E-09 -26.78980037
1.410673598 1 1 1.410673598 2.22045E-16 -52
AUTHOR: CALE SPRATT, JEREMY STORM 9
3.
4.
After five iterations the number of bits of error exceeds
16 bits in all of the simulations, showing that five itera-
tions is sufficient to produce enough accuracy.
1.01 N IA 0.853553391 1.01 D
1 N/D
q*K r*K (3-D*Xi)/2 TRUE Error #bits
0.862088924 0.735838924 1.132080538 1.004987562 0.142898638 -2.806935933
0.975954093 0.943055834 1.028472083 1.004987562 0.029033469 -5.106139236
1.003741539 0.997521859 1.00123907 1.004987562 0.001246023 -9.6484538
1.004985246 0.99999539 1.000002305 1.004987562 2.31634E-06 -18.7197191
1.004987562 1 1 1.004987562 8.00848E-12 -36.86160819
1.004987562 1 1 1.004987562 2.22045E-16 -52
1.7532 N IA 0.853553391 1.7532 D
1 N/D
q*K r*K (3-D*Xi)/2 TRUE Error #bits
1.496449804 1.277299804 0.861350098 1.324084589 0.172365215 -2.536459442
1.288967185 0.947659369 1.026170316 1.324084589 0.035117404 -4.831669987
1.322699864 0.997909496 1.001045252 1.324084589 0.001384726 -9.496183876
1.324082418 0.99999672 1.00000164 1.324084589 2.17146E-06 -18.81290355
1.324084589 1 1 1.324084589 5.34195E-12 -37.44577091
1.324084589 1 1 1.324084589 2.22045E-16 -52
10 HIGH SPEED PAPER SPRAT, STORM
4 APPENDICES
4.1 Specialized Full Adder
AUTHOR: CALE SPRATT, JEREMY STORM 11
4.2 Carry Save Array Multiplier
4.3 Subtractor
12 HIGH SPEED PAPER SPRAT, STORM
4.4 Overall Project Design
4.5 MUXa
AUTHOR: CALE SPRATT, JEREMY STORM 13
4. CONCLUSION
The designed hardware can be implemented into a digi-
tal system that requires the use of division and square
root functions, and necessitates accuracy within 16 bits.
This module can also be implemented with a low
amount of area, especially when there is already a mul-
tiplication module, as the division module can be built
upon a multiplication module. The module also makes a
drastic speed up in implementing division and square
root functions over having the functions being pro-
grammed manually in a system. To additionally speed
up the processing of functions, the design has been
pipelined, increasing the possible throughput of the de-
sign. The error analysis implies that we should retain 39
bits for a division operation and 52 fractional bits for
square rooting. The state tables provide the process
flow with the corresponding registers, the mux select
values, and the mathematical operations occurring after
each iteration. The multiplier registers are internal to
the module to help provide the pipeline capability for
the unit. We basically stall the module after each
interation to insure that multiplications are not over-
flow each other.
REFERENCES
[1] High Speed Computer Arithmetic Class, Dr. Stine, Spring 2013
[2] M. D. Ercegovac, J Muller Design of a Complex Divider aComputer
Science Department, University of California, Los Angeles, California,
U.S.A
[3] Aswin Ramachandran, ECEN 5060- Final Project Implementation of
Goldschmidt Algorithm for Division, Square root and Inverse Square
root Graduate Student, Oklahoma State University, 2006
[4] Javier Hormigo, Julio Villalba and Emilio L. Zapata, Cordic Algorithm
with digits skipping Dept. Computer Architecture. University of Malaga
(SPAIN)