Implentation of Goldschmidt’s algorithm for 16 bit division and square root

1

ECEN 4233 Implentation of Goldschmidts algorithm for 16 bit division and square root

Cale Spratt and Jeremy Storm, Oklahoma State University

Abstract Division and square root are common functions used in digital systems such as microprocessors. Microprocessors that

have hardware to perform division and square root are able to do those functions with fewer clock cycles, and more power

effictiently. This implementation utilizes a 36 bit multiplication unit to implement the Goldschmidt algorithm with enough precision to

give a large amount of accuracy, and it also utilizes a two stage pipeline in order to increase the efficiency of the implantation.

Index TermsGoldschmidts Algorithms, Carry Propagate Adder, Full Adder, Half Adder, Carry Save Array Multiplier

1 INTRODUCTION

In this implemtation, the designed hardware can per-

form these operations:

a) Division

b) Square Root

c) Multiplication

Division and square root are common mathematical

functions. Due to the importance of the two functions,

it is beneficial to create an implementation that runs

efficiently and that produces accurate, reliable results

after each use. To decrease the amount of clock cycles

that are required to perform division or square root, the

design has been pipelined, allowing for two different

divisions or or square roots to be performed at the

same time.

Implementing Goldschmidts algorithms requires the

use of a multiplier, and because of this, the designed

hardware can also implement multiplication, with the

addition of a few control lines, allowing for this module

to be even more functional. Most applications that uti-

lize any division method are dependent upon the preci-

sion expectations that the hardware outputs. For our

implementation we assumed lower precision with a sign

bit, integer bit, and 14-bits of precision for values falling

between [1,2) and (-2,-1]. If our design had requested a

larger division subset then we would have expanded the

precision and implemented a modified approximation

protocol to account for more effective number of

interations. If a division is to occur then the approxima-

tion muxes with select the corresponding sign value of

0.75 or -0.75, otherwise a square root is assumed and

the value will be 0.833. It is very important to select the

correct approximation to create the proper division. The

value must always fall to the left of the quotient when

on an exponentially decreasing graph. Our module uses

the control muxes, A and B, to select the registers

or input values we are going to propagate through the

multiplier. Specific output values have been mapped to

certain registers to create a specific pipelining process.

2 DIVISION

2.1 Goldschmidt Method

The Goldschmidt method is a method for approximating

the zeroes of a real function. This method calculates Q =

N/D. It is an iterative method that successively gives a

better approximation after each iteration. The amount

of bits of error grows quadratically after each successive

iteration. The equation for determining the quotient is

K1 is obtained from utilizing the intial approximation

that is provided from the mux unit dependent upon

mathematical operation.

The number of iterations needed is equal to (n/log_2(radix)), where n is the number of bits of precision.

2 HIGH SPEED PROJECT SPRAT, STORM

3 Hardware Modules

A. Initial Approximation

The initial approximation (IA) module consists of 3

16-bit mux units that cascade a combination of signed

approximations based on the mathematical function

requested. The inputs values are between [1,2) or (-2,-

1]. There are only three constant approximations of

0.75, -0.75, and 0.833 where the first two correspond to

specified signed divisions. The value 0.833 is the ap-

proximation used for a square root operation. A 2-bit

selection input is provided to the cascaded muxes to

select the operational approximation for calculation.

Area is 96 gates. Delay is 6.

B. Multiplexers

We have a total of three multiplexor sets that assist

in implementing the expected mathematical operation

for each interation of the pipeline stages. Multiplexor

A is used to delinieate between the 16-bit values from

the intitial approximation unit, inputs N and D, and the

two register units C and D. A 3-bit mux selection input

determines the 16-bit mux output value that is propa-

gated to our multiplier. All values passed to the multi-

plier through the muxes will be spliced with two addi-

tional bits for higher precision during multiplication. The

table below provides all selection bit operations for

multiplexor A. Area is 288 gates. Delay is 9.

Mux Select (3 bits) Output (16 bits)

000 ---

001 Regc

010 ----

011 D

100 Regd

101 IA

110 ---

111 N

Multiplexor B is implemented to select a 16-bit

output value from inputs of initial approximation, regis-

ter A and register B. This mux has the intented opera-

tion of provide the second multiplication value for our

CSAM multiplier. The logic table for multiplexor B is

listed below. Area is 96 gates. Delay is 6.

Mux Select (2 bits) Output (16 bits)

00 Rega

01 Regb

10 Rega

11 IA

Multiplexor Twos was devoted to propagating ei-

ther a 16-bit two or three to the 2s complement mod-

ule in order to perform either a division or square root.

The table for the logic is provided below. Area is 96

gates. Delay is 6.

Mux Select (1 bit) Output (16 bits)

0 16h4000

1 16h6000

SPRAT, STORM HIGH SPEED PAPER 3

The multiplexor for the 2s complement logic unit

has an area of 48 gates and 3 based on the layout of the

unit.

C. Signed Carry Save Array Multiplier (CSAM)

The Carry Save Array Multiplier (CSAM) is a pipe-

lined module to give the external register units the abil-

ity to store the appropriate multiplication value. Our

multiplier was instantiated to be 36-bits with two 18-bit

inputs from pre-module muxes. This multiplier gives

one integer and sign bits, with 16-bits of precision.

Three 18-bit register modules that incorporate latch

control were placed between the multiplication array

and the Carry Propagate Adders (CPA). This in required

to allow the expected pipelining capability. These regis-

ters store the carries and sum values for propagation

into the CPAs. The implementation was developed to

utilize 16-bit external registers so all the inputs and

outputs have to be extended or spliced to meet the

processing requirements. The most-significant two bits

are always removed following the CPA sum being calcu-

lated. In other words we have performed the same

function as a rounding unit would. Area is 2433 gates.

Delay is 37 .

D. 2s Complement Module

The complement subtractor has to be integrated

to perform both division and square root functions. An

input mux determined the complement value designat-

ed for the intended mathematical operation. For the

division operation, the 2s complement logic uses a 16-

bit value of 2 that implements as

an output from the 2s selection mux. A 16-bit value of

3 will be implemented for a square root operation. Our

2s complement module has 16-bit inputs and outputs

but they have to be modified to include one additional

integer bit for the multiplier value being utilized. We

concatenate one bit after the left-most bit for the mul-

tiplier and one bit to the right-most bit of the subtractor

value. After the subtraction is complete then the addi-

tional integer bit is removed before propagating the

output. Area is 144 gates for 16-bits. Delay is 55

E. Total Area and Delay

The total area for the entire block is the sume of the ini-

tial approximation unit, two input muxes,

squareroot/division mux, multiplier and the 2s comple-

ment logic unit. This area comes out to be 3105 gates.

The delay also follows the same process and comes out

to be roughly 122.

4

3. State Table

3.1 State Table for Division

CLK 0 1 2 3 4 5

mux_selecta 111 111 110 null 100 011

mux_selectb 11 11 11 null 00 00

mux_twos_select 1 1 1 1 1 1

rega_out 0 0 2-K*IA 2-K*IA 2-D*IA 2-D*K0*K1

regb_out 0 0 0 0 0 0

regc_out 0 N*IA N*IA N*IA N*K0*K1 N*K0*K1

regd_out 0 0 D*IA D*IA D*IA D*K0*K1

rega_load 0 0 1 null 0 1

regb_load 0 0 0 null 0 0

regc_load 0 1 0 null 1 0

regd_load 0 0 1 null 0 1

6 7 8 9 10 11

null 100 011 null 100 011

null 00 00 null 00 00

1 1 1 1 1 1

2-D*K0*K1 2-D*K0*K1 2-D*K0*K1*K2

2-D*K0*K1*K2 2-D*K0*K1*K2

2-D*K0*K1*K2*K3

0 0 0 0 0 0

N*K0*K1 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3

D*K0*K1 D*K0*K1 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3

null 0 1 null 0 1

null 0 0 null 0 0

null 1 0 null 1 0

null 0 1 null 0 1

12 13 14

null 100 11

null 00 00

AUTHOR: CALE SPRATT, JEREMY STORM 5

1 1 1

2-D*K0*K1*K2*K3 2-D*K0*K1*K2*K3

2-D*K0*K1*K2*K3*K4

0 0 0

N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4

D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4

null 0 1

null 0 0

null 1 0

null 0 1

3.2 State Table for Square Root

After every three iterations, the 16-bit output from the 2s complement module will be shifted by one bit before

propagating to the mux to create the division necessary for a square root.

CLK 0 1 2 3 4 5

mux_selecta 111 111 110 null 100 011

mux_selectb 11 11 11 null 00 00

mux_twos_select 0 0 0 0 0 0

rega_out 0 0 3-K*IA 3-K*IA 3-D*IA 3-D*K0*K1

regb_out 0 0 0 0 0 0

regc_out 0 N*IA N*IA N*IA N*K0*K1 N*K0*K1

regd_out 0 0 D*IA D*IA D*IA D*K0*K1

rega_load 0 0 1 null 0 1

regb_load 0 0 0 null 0 0

regc_load 0 1 0 null 1 0

regd_load 0 0 1 null 0 1

6 7 8 9 10 11

null 100 011 null 100 011

null 00 00 null 00 00

0 0 0 0 0 0

3-D*K0*K1 3-D*K0*K1 3-D*K0*K1*K2 3-D*K0*K1*K2 3-D*K0*K1*K2 3-D*K0*K1*K2*K3

6 HIGH SPEED PAPER SPRAT, STORM

0 0 0 0 0 0

N*K0*K1 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3

D*K0*K1 D*K0*K1 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3

null 0 1 null 0 1

null 0 0 null 0 0

null 1 0 null 1 0

null 0 1 null 0 1

12 13 14

null 100 11

null 00 00

0 0 0

3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3*K4

0 0 0

N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4

D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4

null 0 1

null 0 0

null 1 0

null 0 1

3 Error Analysis

1 ERROR ANALYSIS

1.1 Error Analysis for Division

The analysis for division and square root was performed using the excel spreadsheets provided in the course.

1.923 N IA 0.75 1.523 D

1.262639527 N/D

q*K r*K 2-D*Xi TRUE Error #bits

1.44225 1.14225 0.85775 1.262639527 0.179610473 -2.477056622

1.237089938 0.979764938 1.020235063 1.262639527 0.02554959 -5.290556064

1.26212253 0.999590542 1.000409458 1.262639527 0.000516998 -10.91755495

1.262639316 0.999999832 1.000000168 1.262639527 2.11689E-07 -22.17155272

1.262639527 1 1 1.262639527 3.59712E-14 -44.66015

1.262639527 1 1 1.262639527 6.66134E-16 -50.4150375


2.

1.923 N IA 0.75 1.013 D

1.898321816 N/D


1.44225 0.75975 1.24025 1.898321816 0.456071816 -1.132667075

1.788750563 0.942279938 1.057720063 1.898321816 0.109571254 -3.190058739

1.891997357 0.996668394 1.003331606 1.898321816 0.00632446 -7.304842067

1.898300746 0.9999889 1.0000111 1.898321816 2.10706E-05 -15.53440872

1.898321816 1 1 1.898321816 2.33875E-10 -31.99354105

1.898321816 1 1 1.898321816 2.22045E-16 -52

3.

-1.012 N IA 0.75 1.9123 D

-0.529205669 N/D


-0.759 1.434225 0.565775 -0.529205669 0.229794331 -2.121584885

-0.429423225 0.811448649 1.188551351 -0.529205669 0.099782444 -3.32507019

-0.510391554 0.964448388 1.035551612 -0.529205669 0.018814115 -5.7320408

-0.528536796 0.998736083 1.001263917 -0.529205669 0.000668872 -10.54598202

-0.529204823 0.999998403 1.000001597 -0.529205669 8.45399E-07 -20.17386446

-0.529205669 1 1 -0.529205669 1.35048E-12 -39.4296699

4.

1.012 N IA 0.75 1.9123 D

0.529205669 N/D


0.759 1.434225 0.565775 0.529205669 0.229794331 -2.121584885

0.429423225 0.811448649 1.188551351 0.529205669 0.099782444 -3.32507019

0.510391554 0.964448388 1.035551612 0.529205669 0.018814115 -5.7320408

0.528536796 0.998736083 1.001263917 0.529205669 0.000668872 -10.54598202

0.529204823 0.999998403 1.000001597 0.529205669 8.45399E-07 -20.17386446

0.529205669 1 1 0.529205669 1.35048E-12 -39.4296699


In all the above cases, the amount of error was greater than 16 bits after five iterations, showing that the the de-

sign should be accurate to the amount of bits being used on the design.

1.2 Error Analysis for Square Root

1.

2.

1.231 N IA 0.853553391 1.231 D

1 N/D

q*K r*K (3-D*Xi)/2 TRUE Error #bits

1.050724224 0.896849224 1.051575388 1.109504394 0.05878017 -4.088526657

1.104915733 0.991745555 1.004127223 1.109504394 0.00458866 -7.767711238

1.109475967 0.999948757 1.000025621 1.109504394 2.84273E-05 -15.10236562

1.109504393 0.999999998 1.000000001 1.109504394 1.09252E-09 -29.76969629

1.109504394 1 1 1.109504394 2.22045E-16 -52

1.109504394 1 1 1.109504394 2.22045E-16 -52

1.99 N IA 0.853553391 1.99 D

1 N/D


1.698571247 1.449821247 0.775089376 1.410673598 0.287897649 -1.796372086

1.316544529 0.870999747 1.064500127 1.410673598 0.094129069 -3.409215861

1.401461818 0.986982526 1.006508737 1.410673598 0.00921178 -6.762304257

1.410583564 0.999872358 1.000063821 1.410673598 9.00338E-05 -13.43917395

1.410673589 0.999999988 1.000000006 1.410673598 8.61919E-09 -26.78980037

1.410673598 1 1 1.410673598 2.22045E-16 -52


3.

4.

After five iterations the number of bits of error exceeds

16 bits in all of the simulations, showing that five itera-

tions is sufficient to produce enough accuracy.

1.01 N IA 0.853553391 1.01 D

1 N/D


0.862088924 0.735838924 1.132080538 1.004987562 0.142898638 -2.806935933

0.975954093 0.943055834 1.028472083 1.004987562 0.029033469 -5.106139236

1.003741539 0.997521859 1.00123907 1.004987562 0.001246023 -9.6484538

1.004985246 0.99999539 1.000002305 1.004987562 2.31634E-06 -18.7197191

1.004987562 1 1 1.004987562 8.00848E-12 -36.86160819

1.004987562 1 1 1.004987562 2.22045E-16 -52

1.7532 N IA 0.853553391 1.7532 D

1 N/D


1.496449804 1.277299804 0.861350098 1.324084589 0.172365215 -2.536459442

1.288967185 0.947659369 1.026170316 1.324084589 0.035117404 -4.831669987

1.322699864 0.997909496 1.001045252 1.324084589 0.001384726 -9.496183876

1.324082418 0.99999672 1.00000164 1.324084589 2.17146E-06 -18.81290355

1.324084589 1 1 1.324084589 5.34195E-12 -37.44577091

1.324084589 1 1 1.324084589 2.22045E-16 -52


4 APPENDICES

4.1 Specialized Full Adder


4.2 Carry Save Array Multiplier

4.3 Subtractor


4.4 Overall Project Design

4.5 MUXa


4. CONCLUSION

The designed hardware can be implemented into a digi-

tal system that requires the use of division and square

root functions, and necessitates accuracy within 16 bits.

This module can also be implemented with a low

amount of area, especially when there is already a mul-

tiplication module, as the division module can be built

upon a multiplication module. The module also makes a

drastic speed up in implementing division and square

root functions over having the functions being pro-

grammed manually in a system. To additionally speed

up the processing of functions, the design has been

pipelined, increasing the possible throughput of the de-

sign. The error analysis implies that we should retain 39

bits for a division operation and 52 fractional bits for

square rooting. The state tables provide the process

flow with the corresponding registers, the mux select

values, and the mathematical operations occurring after

each iteration. The multiplier registers are internal to

the module to help provide the pipeline capability for

the unit. We basically stall the module after each

interation to insure that multiplications are not over-

flow each other.

REFERENCES

[1] High Speed Computer Arithmetic Class, Dr. Stine, Spring 2013

[2] M. D. Ercegovac, J Muller Design of a Complex Divider aComputer

Science Department, University of California, Los Angeles, California,

U.S.A

[3] Aswin Ramachandran, ECEN 5060- Final Project Implementation of

Goldschmidt Algorithm for Division, Square root and Inverse Square

root Graduate Student, Oklahoma State University, 2006

[4] Javier Hormigo, Julio Villalba and Emilio L. Zapata, Cordic Algorithm

with digits skipping Dept. Computer Architecture. University of Malaga

(SPAIN)

Documents

Implentation of Goldschmidt’s algorithm for 16 bit division and square root