Implentation of Goldschmidt’s algorithm for 16 bit division and square root

Embed Size (px)

DESCRIPTION

Division and square root are common functions used in digital systems such as microprocessors. Microprocessors that have hardware to perform division and square root are able to do those functions with fewer clock cycles, and more power effictiently. This implementation utilizes a 36 bit multiplication unit to implement the Goldschmidt algorithm with enough precision to give a large amount of accuracy, and it also utilizes a two stage pipeline in order to increase the efficiency of the implementation.

Citation preview

  • 1

    ECEN 4233 Implentation of Goldschmidts algorithm for 16 bit division and square root

    Cale Spratt and Jeremy Storm, Oklahoma State University

    Abstract Division and square root are common functions used in digital systems such as microprocessors. Microprocessors that

    have hardware to perform division and square root are able to do those functions with fewer clock cycles, and more power

    effictiently. This implementation utilizes a 36 bit multiplication unit to implement the Goldschmidt algorithm with enough precision to

    give a large amount of accuracy, and it also utilizes a two stage pipeline in order to increase the efficiency of the implantation.

    Index TermsGoldschmidts Algorithms, Carry Propagate Adder, Full Adder, Half Adder, Carry Save Array Multiplier

    1 INTRODUCTION

    In this implemtation, the designed hardware can per-

    form these operations:

    a) Division

    b) Square Root

    c) Multiplication

    Division and square root are common mathematical

    functions. Due to the importance of the two functions,

    it is beneficial to create an implementation that runs

    efficiently and that produces accurate, reliable results

    after each use. To decrease the amount of clock cycles

    that are required to perform division or square root, the

    design has been pipelined, allowing for two different

    divisions or or square roots to be performed at the

    same time.

    Implementing Goldschmidts algorithms requires the

    use of a multiplier, and because of this, the designed

    hardware can also implement multiplication, with the

    addition of a few control lines, allowing for this module

    to be even more functional. Most applications that uti-

    lize any division method are dependent upon the preci-

    sion expectations that the hardware outputs. For our

    implementation we assumed lower precision with a sign

    bit, integer bit, and 14-bits of precision for values falling

    between [1,2) and (-2,-1]. If our design had requested a

    larger division subset then we would have expanded the

    precision and implemented a modified approximation

    protocol to account for more effective number of

    interations. If a division is to occur then the approxima-

    tion muxes with select the corresponding sign value of

    0.75 or -0.75, otherwise a square root is assumed and

    the value will be 0.833. It is very important to select the

    correct approximation to create the proper division. The

    value must always fall to the left of the quotient when

    on an exponentially decreasing graph. Our module uses

    the control muxes, A and B, to select the registers

    or input values we are going to propagate through the

    multiplier. Specific output values have been mapped to

    certain registers to create a specific pipelining process.

    2 DIVISION

    2.1 Goldschmidt Method

    The Goldschmidt method is a method for approximating

    the zeroes of a real function. This method calculates Q =

    N/D. It is an iterative method that successively gives a

    better approximation after each iteration. The amount

    of bits of error grows quadratically after each successive

    iteration. The equation for determining the quotient is

    K1 is obtained from utilizing the intial approximation

    that is provided from the mux unit dependent upon

    mathematical operation.

    The number of iterations needed is equal to (n/log_2(radix)), where n is the number of bits of precision.

  • 2 HIGH SPEED PROJECT SPRAT, STORM

    3 Hardware Modules

    A. Initial Approximation

    The initial approximation (IA) module consists of 3

    16-bit mux units that cascade a combination of signed

    approximations based on the mathematical function

    requested. The inputs values are between [1,2) or (-2,-

    1]. There are only three constant approximations of

    0.75, -0.75, and 0.833 where the first two correspond to

    specified signed divisions. The value 0.833 is the ap-

    proximation used for a square root operation. A 2-bit

    selection input is provided to the cascaded muxes to

    select the operational approximation for calculation.

    Area is 96 gates. Delay is 6.

    B. Multiplexers

    We have a total of three multiplexor sets that assist

    in implementing the expected mathematical operation

    for each interation of the pipeline stages. Multiplexor

    A is used to delinieate between the 16-bit values from

    the intitial approximation unit, inputs N and D, and the

    two register units C and D. A 3-bit mux selection input

    determines the 16-bit mux output value that is propa-

    gated to our multiplier. All values passed to the multi-

    plier through the muxes will be spliced with two addi-

    tional bits for higher precision during multiplication. The

    table below provides all selection bit operations for

    multiplexor A. Area is 288 gates. Delay is 9.

    Mux Select (3 bits) Output (16 bits)

    000 ---

    001 Regc

    010 ----

    011 D

    100 Regd

    101 IA

    110 ---

    111 N

    Multiplexor B is implemented to select a 16-bit

    output value from inputs of initial approximation, regis-

    ter A and register B. This mux has the intented opera-

    tion of provide the second multiplication value for our

    CSAM multiplier. The logic table for multiplexor B is

    listed below. Area is 96 gates. Delay is 6.

    Mux Select (2 bits) Output (16 bits)

    00 Rega

    01 Regb

    10 Rega

    11 IA

    Multiplexor Twos was devoted to propagating ei-

    ther a 16-bit two or three to the 2s complement mod-

    ule in order to perform either a division or square root.

    The table for the logic is provided below. Area is 96

    gates. Delay is 6.

    Mux Select (1 bit) Output (16 bits)

    0 16h4000

    1 16h6000

  • SPRAT, STORM HIGH SPEED PAPER 3

    The multiplexor for the 2s complement logic unit

    has an area of 48 gates and 3 based on the layout of the

    unit.

    C. Signed Carry Save Array Multiplier (CSAM)

    The Carry Save Array Multiplier (CSAM) is a pipe-

    lined module to give the external register units the abil-

    ity to store the appropriate multiplication value. Our

    multiplier was instantiated to be 36-bits with two 18-bit

    inputs from pre-module muxes. This multiplier gives

    one integer and sign bits, with 16-bits of precision.

    Three 18-bit register modules that incorporate latch

    control were placed between the multiplication array

    and the Carry Propagate Adders (CPA). This in required

    to allow the expected pipelining capability. These regis-

    ters store the carries and sum values for propagation

    into the CPAs. The implementation was developed to

    utilize 16-bit external registers so all the inputs and

    outputs have to be extended or spliced to meet the

    processing requirements. The most-significant two bits

    are always removed following the CPA sum being calcu-

    lated. In other words we have performed the same

    function as a rounding unit would. Area is 2433 gates.

    Delay is 37 .

    D. 2s Complement Module

    The complement subtractor has to be integrated

    to perform both division and square root functions. An

    input mux determined the complement value designat-

    ed for the intended mathematical operation. For the

    division operation, the 2s complement logic uses a 16-

    bit value of 2 that implements as

    an output from the 2s selection mux. A 16-bit value of

    3 will be implemented for a square root operation. Our

    2s complement module has 16-bit inputs and outputs

    but they have to be modified to include one additional

    integer bit for the multiplier value being utilized. We

    concatenate one bit after the left-most bit for the mul-

    tiplier and one bit to the right-most bit of the subtractor

    value. After the subtraction is complete then the addi-

    tional integer bit is removed before propagating the

    output. Area is 144 gates for 16-bits. Delay is 55

    E. Total Area and Delay

    The total area for the entire block is the sume of the ini-

    tial approximation unit, two input muxes,

    squareroot/division mux, multiplier and the 2s comple-

    ment logic unit. This area comes out to be 3105 gates.

    The delay also follows the same process and comes out

    to be roughly 122.

  • 4

    3. State Table

    3.1 State Table for Division

    CLK 0 1 2 3 4 5

    mux_selecta 111 111 110 null 100 011

    mux_selectb 11 11 11 null 00 00

    mux_twos_select 1 1 1 1 1 1

    rega_out 0 0 2-K*IA 2-K*IA 2-D*IA 2-D*K0*K1

    regb_out 0 0 0 0 0 0

    regc_out 0 N*IA N*IA N*IA N*K0*K1 N*K0*K1

    regd_out 0 0 D*IA D*IA D*IA D*K0*K1

    rega_load 0 0 1 null 0 1

    regb_load 0 0 0 null 0 0

    regc_load 0 1 0 null 1 0

    regd_load 0 0 1 null 0 1

    6 7 8 9 10 11

    null 100 011 null 100 011

    null 00 00 null 00 00

    1 1 1 1 1 1

    2-D*K0*K1 2-D*K0*K1 2-D*K0*K1*K2

    2-D*K0*K1*K2 2-D*K0*K1*K2

    2-D*K0*K1*K2*K3

    0 0 0 0 0 0

    N*K0*K1 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3

    D*K0*K1 D*K0*K1 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3

    null 0 1 null 0 1

    null 0 0 null 0 0

    null 1 0 null 1 0

    null 0 1 null 0 1

    12 13 14

    null 100 11

    null 00 00

  • AUTHOR: CALE SPRATT, JEREMY STORM 5

    1 1 1

    2-D*K0*K1*K2*K3 2-D*K0*K1*K2*K3

    2-D*K0*K1*K2*K3*K4

    0 0 0

    N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4

    D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4

    null 0 1

    null 0 0

    null 1 0

    null 0 1

    3.2 State Table for Square Root

    After every three iterations, the 16-bit output from the 2s complement module will be shifted by one bit before

    propagating to the mux to create the division necessary for a square root.

    CLK 0 1 2 3 4 5

    mux_selecta 111 111 110 null 100 011

    mux_selectb 11 11 11 null 00 00

    mux_twos_select 0 0 0 0 0 0

    rega_out 0 0 3-K*IA 3-K*IA 3-D*IA 3-D*K0*K1

    regb_out 0 0 0 0 0 0

    regc_out 0 N*IA N*IA N*IA N*K0*K1 N*K0*K1

    regd_out 0 0 D*IA D*IA D*IA D*K0*K1

    rega_load 0 0 1 null 0 1

    regb_load 0 0 0 null 0 0

    regc_load 0 1 0 null 1 0

    regd_load 0 0 1 null 0 1

    6 7 8 9 10 11

    null 100 011 null 100 011

    null 00 00 null 00 00

    0 0 0 0 0 0

    3-D*K0*K1 3-D*K0*K1 3-D*K0*K1*K2 3-D*K0*K1*K2 3-D*K0*K1*K2 3-D*K0*K1*K2*K3

  • 6 HIGH SPEED PAPER SPRAT, STORM

    0 0 0 0 0 0

    N*K0*K1 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2 N*K0*K1*K2*K3 N*K0*K1*K2*K3

    D*K0*K1 D*K0*K1 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2 D*K0*K1*K2*K3

    null 0 1 null 0 1

    null 0 0 null 0 0

    null 1 0 null 1 0

    null 0 1 null 0 1

    12 13 14

    null 100 11

    null 00 00

    0 0 0

    3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3 3-D*K0*K1*K2*K3*K4

    0 0 0

    N*K0*K1*K2*K3 N*K0*K1*K2*K3*K4 N*K0*K1*K2*K3*K4

    D*K0*K1*K2*K3 D*K0*K1*K2*K3 D*K0*K1*K2*K3*K4

    null 0 1

    null 0 0

    null 1 0

    null 0 1

    3 Error Analysis

    1 ERROR ANALYSIS

    1.1 Error Analysis for Division

    The analysis for division and square root was performed using the excel spreadsheets provided in the course.

    1.923 N IA 0.75 1.523 D

    1.262639527 N/D

    q*K r*K 2-D*Xi TRUE Error #bits

    1.44225 1.14225 0.85775 1.262639527 0.179610473 -2.477056622

    1.237089938 0.979764938 1.020235063 1.262639527 0.02554959 -5.290556064

    1.26212253 0.999590542 1.000409458 1.262639527 0.000516998 -10.91755495

    1.262639316 0.999999832 1.000000168 1.262639527 2.11689E-07 -22.17155272

    1.262639527 1 1 1.262639527 3.59712E-14 -44.66015

    1.262639527 1 1 1.262639527 6.66134E-16 -50.4150375

  • AUTHOR: CALE SPRATT, JEREMY STORM 7

    2.

    1.923 N IA 0.75 1.013 D

    1.898321816 N/D

    q*K r*K 2-D*Xi TRUE Error #bits

    1.44225 0.75975 1.24025 1.898321816 0.456071816 -1.132667075

    1.788750563 0.942279938 1.057720063 1.898321816 0.109571254 -3.190058739

    1.891997357 0.996668394 1.003331606 1.898321816 0.00632446 -7.304842067

    1.898300746 0.9999889 1.0000111 1.898321816 2.10706E-05 -15.53440872

    1.898321816 1 1 1.898321816 2.33875E-10 -31.99354105

    1.898321816 1 1 1.898321816 2.22045E-16 -52

    3.

    -1.012 N IA 0.75 1.9123 D

    -0.529205669 N/D

    q*K r*K 2-D*Xi TRUE Error #bits

    -0.759 1.434225 0.565775 -0.529205669 0.229794331 -2.121584885

    -0.429423225 0.811448649 1.188551351 -0.529205669 0.099782444 -3.32507019

    -0.510391554 0.964448388 1.035551612 -0.529205669 0.018814115 -5.7320408

    -0.528536796 0.998736083 1.001263917 -0.529205669 0.000668872 -10.54598202

    -0.529204823 0.999998403 1.000001597 -0.529205669 8.45399E-07 -20.17386446

    -0.529205669 1 1 -0.529205669 1.35048E-12 -39.4296699

    4.

    1.012 N IA 0.75 1.9123 D

    0.529205669 N/D

    q*K r*K 2-D*Xi TRUE Error #bits

    0.759 1.434225 0.565775 0.529205669 0.229794331 -2.121584885

    0.429423225 0.811448649 1.188551351 0.529205669 0.099782444 -3.32507019

    0.510391554 0.964448388 1.035551612 0.529205669 0.018814115 -5.7320408

    0.528536796 0.998736083 1.001263917 0.529205669 0.000668872 -10.54598202

    0.529204823 0.999998403 1.000001597 0.529205669 8.45399E-07 -20.17386446

    0.529205669 1 1 0.529205669 1.35048E-12 -39.4296699

  • 8 HIGH SPEED PAPER SPRAT, STORM

    In all the above cases, the amount of error was greater than 16 bits after five iterations, showing that the the de-

    sign should be accurate to the amount of bits being used on the design.

    1.2 Error Analysis for Square Root

    1.

    2.

    1.231 N IA 0.853553391 1.231 D

    1 N/D

    q*K r*K (3-D*Xi)/2 TRUE Error #bits

    1.050724224 0.896849224 1.051575388 1.109504394 0.05878017 -4.088526657

    1.104915733 0.991745555 1.004127223 1.109504394 0.00458866 -7.767711238

    1.109475967 0.999948757 1.000025621 1.109504394 2.84273E-05 -15.10236562

    1.109504393 0.999999998 1.000000001 1.109504394 1.09252E-09 -29.76969629

    1.109504394 1 1 1.109504394 2.22045E-16 -52

    1.109504394 1 1 1.109504394 2.22045E-16 -52

    1.99 N IA 0.853553391 1.99 D

    1 N/D

    q*K r*K (3-D*Xi)/2 TRUE Error #bits

    1.698571247 1.449821247 0.775089376 1.410673598 0.287897649 -1.796372086

    1.316544529 0.870999747 1.064500127 1.410673598 0.094129069 -3.409215861

    1.401461818 0.986982526 1.006508737 1.410673598 0.00921178 -6.762304257

    1.410583564 0.999872358 1.000063821 1.410673598 9.00338E-05 -13.43917395

    1.410673589 0.999999988 1.000000006 1.410673598 8.61919E-09 -26.78980037

    1.410673598 1 1 1.410673598 2.22045E-16 -52

  • AUTHOR: CALE SPRATT, JEREMY STORM 9

    3.

    4.

    After five iterations the number of bits of error exceeds

    16 bits in all of the simulations, showing that five itera-

    tions is sufficient to produce enough accuracy.

    1.01 N IA 0.853553391 1.01 D

    1 N/D

    q*K r*K (3-D*Xi)/2 TRUE Error #bits

    0.862088924 0.735838924 1.132080538 1.004987562 0.142898638 -2.806935933

    0.975954093 0.943055834 1.028472083 1.004987562 0.029033469 -5.106139236

    1.003741539 0.997521859 1.00123907 1.004987562 0.001246023 -9.6484538

    1.004985246 0.99999539 1.000002305 1.004987562 2.31634E-06 -18.7197191

    1.004987562 1 1 1.004987562 8.00848E-12 -36.86160819

    1.004987562 1 1 1.004987562 2.22045E-16 -52

    1.7532 N IA 0.853553391 1.7532 D

    1 N/D

    q*K r*K (3-D*Xi)/2 TRUE Error #bits

    1.496449804 1.277299804 0.861350098 1.324084589 0.172365215 -2.536459442

    1.288967185 0.947659369 1.026170316 1.324084589 0.035117404 -4.831669987

    1.322699864 0.997909496 1.001045252 1.324084589 0.001384726 -9.496183876

    1.324082418 0.99999672 1.00000164 1.324084589 2.17146E-06 -18.81290355

    1.324084589 1 1 1.324084589 5.34195E-12 -37.44577091

    1.324084589 1 1 1.324084589 2.22045E-16 -52

  • 10 HIGH SPEED PAPER SPRAT, STORM

    4 APPENDICES

    4.1 Specialized Full Adder

  • AUTHOR: CALE SPRATT, JEREMY STORM 11

    4.2 Carry Save Array Multiplier

    4.3 Subtractor

  • 12 HIGH SPEED PAPER SPRAT, STORM

    4.4 Overall Project Design

    4.5 MUXa

  • AUTHOR: CALE SPRATT, JEREMY STORM 13

    4. CONCLUSION

    The designed hardware can be implemented into a digi-

    tal system that requires the use of division and square

    root functions, and necessitates accuracy within 16 bits.

    This module can also be implemented with a low

    amount of area, especially when there is already a mul-

    tiplication module, as the division module can be built

    upon a multiplication module. The module also makes a

    drastic speed up in implementing division and square

    root functions over having the functions being pro-

    grammed manually in a system. To additionally speed

    up the processing of functions, the design has been

    pipelined, increasing the possible throughput of the de-

    sign. The error analysis implies that we should retain 39

    bits for a division operation and 52 fractional bits for

    square rooting. The state tables provide the process

    flow with the corresponding registers, the mux select

    values, and the mathematical operations occurring after

    each iteration. The multiplier registers are internal to

    the module to help provide the pipeline capability for

    the unit. We basically stall the module after each

    interation to insure that multiplications are not over-

    flow each other.

    REFERENCES

    [1] High Speed Computer Arithmetic Class, Dr. Stine, Spring 2013

    [2] M. D. Ercegovac, J Muller Design of a Complex Divider aComputer

    Science Department, University of California, Los Angeles, California,

    U.S.A

    [3] Aswin Ramachandran, ECEN 5060- Final Project Implementation of

    Goldschmidt Algorithm for Division, Square root and Inverse Square

    root Graduate Student, Oklahoma State University, 2006

    [4] Javier Hormigo, Julio Villalba and Emilio L. Zapata, Cordic Algorithm

    with digits skipping Dept. Computer Architecture. University of Malaga

    (SPAIN)