05494882

Embed Size (px)

Citation preview

  • 7/28/2019 05494882

    1/11

    Published in IET Computers & Digital Techniques

    Received on 27th March 2009

    Revised on 27th September 2009

    doi: 10.1049/iet-cdt.2009.0036

    ISSN 1751-8601

    Field programmable gate array prototyping

    of end-around carry parallel prefixtree architectures

    F. Liu1

    Q. Tan1

    G. Chen2

    X. Song3

    O. Ait Mohamed4

    M. Gu5

    1National Lab of Parallel Distributed Processing, Hunan, China2

    Lingcore Lab, Portland, OR, USA3ECE Department, Portland State University, Portland, OR, USA4ECE Department, Concordia University, Montreal, Quebec, Canada5

    School of Software, TsingHua University, Beijing, China

    E-mail: [email protected]

    Abstract: As an important part of many processorss floating point unit, fused multiply-add unit performs a

    multiplication followed immediately by an addition. In IBM POWER6 microprocessors fused multiply-add unit,

    a fast 128-bit floating-point end-around-carry (EAC) adder is proposed. Very few algorithmic details exist in

    todays literature about this adder. In this study, a complete designed EAC adder that can work independently

    as a regular adder is proposed. Details about the proposed EAC adders arithmetic algorithms are described.In IBMs original EAC adder, the KoggeStone tree has been chosen for its high performance on ASIC

    technology. In this study, the authors present a comparative study on different parallel prefix trees which are

    used in the design of our new EAC adder targeting field programmable gate array (FPGA) technology. Our

    study highlights the main performance differences among 14 different architecture configurations focusing on

    the area requirements and the critical path delay. The experimental results show that there is one

    architecture configuration with the lower area requirement and the higher performance.

    1 Introduction

    Fused multiply-add unit plays an important role in modern

    microprocessor. It performs floating-point multiplicationfollowed immediately by an addition of the product with athird floating-point operand. In 2007, a seven-cycle fusedmultiply-add pipeline unit was proposed [1] as a part ofthe floating-point unit in IBMs POWER6microprocessor. In this fused multiply-add dataflow, theproduct should be aligned before it is added with theaddend. Because the magnitude of the product is unknownin the early stages prior to the combination with theaddend, it is difficult to determine a priori which operandis bigger [2]. Even if it was determined early that theproduct was bigger, there would be a problem on

    conditionally complementing two intermediate operands,the carry and sum outputs of the counter tree. Thus, anadder needs to be designed to always output a positive

    magnitude result and preferably only needs to conditionallycomplement one operand [2]. Therefore a new 128-bitend-around carry (EAC) adder was designed and fabricated

    in IBMs fused multiply-add unit [3]. The intention is notto produce an adder with the best stand-alone performancebut to provide the one with the best overall floating-pointperformance [3].

    IBM implemented its EAC adder in a 65 nm SOItechnology [4] and some sub-components are implementedusing KoggeStone tree [5]. In fact, the LadnerFischertree [6] was used in IBMs first pass test chip. Comparedto Ladner Fischer design, the KoggeStone design isabout 0.5 FO4 faster with only 6% area overhead and 5%power increase [3]. Therefore the KoggeStone tree was

    chosen in the final design. Besides KoggeStone tree andLadnerFischer tree, it is known that there are many othervariations of parallel prefix trees [7]. The motivation for our

    306 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036

    www.ietdl.org

  • 7/28/2019 05494882

    2/11

    work was to find the best EAC adder for use in a fusedmultiply-add unit.

    We also notice that field programmable gate array (FPGA)technology has recently enjoyed a rapidly increasingpopularity. With nanotechnology era, the logic density of

    FPGA has increased dramatically. Because the fixedstructure and large variety of resources of FPGA possessthe potential to affect significantly the implementationresults. One interesting thing is to check whether the EACadder can work well and to study the performancedifferences among different architecture configurationsfocusing on the area requirements and the critical pathdelay on FPGA technology.

    Since it would be difficult to evaluate the full floating-point performance, in this paper, we propose a completedesigned EAC adder that can work independently without

    being a part of the fused multiply-add unit. Very fewdescription on EAC adders formulations exist in todaysliterature, therefore details of the proposed EAC addersarithmetic algorithms are explained. Because the algorithmsof our EAC adder mainly follows the IBM EAC addersarithmetic algorithms and can be read without theknowledge of the whole floating-point unit, it is our beliefthat our description would be helpful for people to get abetter understanding about the nature of the EAC design.

    To make our EAC adder can work as a regular adder, wedesign some new logic units such as input logic unit, signlogic unit and so on. This design makes it easier toimplement and test other design choices. On the otherhand, the additional logic units do not affect the EACadders key behaviours, evaluations of our EAC addersdifferent designs has relevance to fused multiply-add unitdesign. We study the performance of EAC adder withdifferent parallel prefix trees on FPGA technology. Theexperimental results show that there is one architectureconfiguration with the lower area requirement and thehigher performance.

    The paper is organised as follows. In Section 2, the relatedworks are reviewed. In Section 3, some preliminaries and thealgorithms of the 32-bit adder block are presented. Section 4

    describes the architecture of our proposed 128-bit EACadder and its arithmetic algorithms. Section 5 explains theimplementation of different parallel prefix trees in ourEAC adder and reports the simulation results. Section 6concludes this paper.

    2 Related work

    In the past few years, several adders used in the fusedmultiply-add operation have been proposed [810]. Theseadder schemes are based on delay profile of the multiplycompression tree. At a result, they are power efficient only

    when the final addition is performed right after thecompression tree and when the EAC computation is notneeded [3]. For higher floating-point performance, the

    EAC adder is used in recent processors. Although theEAC adder has become common hardware designpractices, this technique has not been well documented.Shedletsky [11] analysed some behaviours of EAC adderusing some real circuits examples. Yu et al. [3] proposed afast 128-bit EAC adder which is fabricated as part of the

    IBM POWER6 microprocessor. They described theadders architecture and analysed its performance andpower dissipation. Zhang et al. [12] presented a 108-bitEAC adder which is also used by a fused multiply-addunit. Structure-aware layout techniques were used tooptimise their adders structure. All the works abovefocused on the EAC adders architecture design, whiledetails of its arithmetic algorithms were not explained.Schwarz [2] discussed some aspects of the EAC addersalgorithms, but some details were still not included.

    On the other hand, parallel prefix tree is recently used as a

    subcomponent of the EAC adder. There are many classicparallel prefix adders that have been proposed, includingSklansky [13], KoggeStone and BrentKung [14]. Theseprefix networks achieve three extreme goals: minimal logiclevels and wire tracks, minimal max-fanout and logic levels,and minimal wire tracks and max-fanout, respectively. Inaddition, LadnerFischer, HanCarlson [15] and Knowles[7] implemented the trade-off between each pair of theextreme cases. Structure of the prefix network determinesthe type of the prefix adder. Ziegler et al. [16] consideredsparsity, fanout and radix as three dimensions in the designspace of regular parallel prefix adders and presented aunified formalism to describe such structures. Liu et al.[17] studied how to find optimal prefix structures forspecific applications and proposed an integer linearprogramming method to build minimal-power prefixadders within a given timing and area constraints. In IBMPOWER6s EAC adder, by chip test, it was found thatKogge Stone tree was a better choice than LadnerFischer tree. The works discussed above are based on ASICtechnology. Vitoroulis et al. [18] investigated theperformance of parallel prefix adders implemented withFPGA technology. It reported on the area requirementsand critical path delay for a variety of classical parallel prefixadder structures. However, parallel prefix trees were

    implemented as a single adder, without being a part ofbigger designs. In our work, we try to answer thesequestions: What are the arithmetic algorithms of EACadder with parallel prefix tree? If we use different parallelprefix tree in the EAC adder on FPGA technology, whichone is better? How parallel prefix trees affect the otherparts of EAC adder? As a part of EAC adder, should theimplementation of parallel prefix tree itself be changed?

    3 Thirty-two-bit adder block inEAC adder

    In this paper, the symbols 0 and 1 denote Boolean false andtrue, or digital number zero and one, respectively; the symbol

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 307

    doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010

    www.ietdl.org

  • 7/28/2019 05494882

    3/11

    ^ denotes the Boolean AND; _ (or + ) denotes theBoolean OR; denotes the Boolean Exclusive OR. Abinary number of length n (n 1) is an ordered sequenceof binary bits where each bit can assume one of the values0 or 1. For traditional integer adder, we usex= (xn1xn2, . . . , x1x0), y= (yn1yn2, . . . ,y1y0) to

    denote the two n-bit addends and s= (sn1sn2, . . . , s0) todenote the corresponding sum (n 1); xi, yi, si denote thebinary bits ofx, y, s at position i, where 0 i n 1. Letc= {cn, cn1, . . . , c0} be the corresponding set of carries

    where c0 is the initial incoming carry, ci denotes the carryform the bit position i 1 and cn is the outgoing carry.

    To explain the adders algorithm, some standard notionssuch as propagated carry, generated carry, group-propagatedcarry and group-generated carry should be introduced.

    These notions are related to parallel prefix trees and theirdefinitions can be found in Koren [19]. In this paper, we

    use Pi = xiyi, Gi= xi^yi (for simplicity, Gi= xiyi) todenote the propagated carry and generated carry at bitposition i, respectively. We use Pi:j, Gi:j to denote thegroup-propagated carry and group-generated carry for thebit positions i, i 1, . . . , j, respectively.

    The notation of carry select adder is also important. Forthe group that consists of k bit positions starting with bitposition j and ending with bit position i, wherei=j+ k 1, the outputs of carry select adder are the sumbits si, si1, . . . , sj and the outgoing carry ci+1. Theseoutputs can be selected by the incoming carry into thisgroup c

    j

    as follows

    ci+1 = [c0i+1 ^ cj] _ [c

    1i+1 ^ cj]

    sm = [s0m ^ cj] _ [s

    1m ^ cj] (m =j, j+ 1, . . . , i)

    (1)

    cj is the Boolean complement code of cj; s0m is the sum bit at

    bit position m under the condition that the incoming carry is0 and c0i+1 is the corresponding outgoing carry; s

    1m, c

    1i+1 are the

    sum bit at position m and the outgoing carry under thecondition that the incoming carry into the group is1. Other useful notions and formulations about parallelprefix trees and carry select adder can be found in Koren [19].

    Since the carry signal is on the critical path, to obtain ahigh performance floating-point unit, a 128-bit EAC adder

    was designed in IBM POWER6 microprocessor. Fig. 1shows its block diagram.

    This adder is divided into three sub-blocks: the 32-bitadder block, the EAC logic block and the final sumselection block [3]. Each 32-bit adder block is alsopartitioned into three sub-components. The first sub-component is an 8-bit prefix-2 Kogge stone tree withsparseness of 2 that generates 8-bit propagates as well as

    conditional sums that are needed later for sum selection.The second sub-component is a prefix-2 Kogge stone treewith sparseness of 8 that generates 32-bit propagated terms

    as well as 32-bit conditional sums. The last sub-component

    is a sum selection block [3].

    From Fig. 1 we know that the 128-bit EAC adder iscomposed of four 32-bit adder blocks. Each 32-bit adderblocks architecture is shown in Fig. 2. Each 32-bit adderblock is actually a carry select adder consisting of four 8-bitadder blocks. Each 8-bit adder block has the structuredepicted in Fig. 3.

    In each 8-bit adder block, there are two 8-bit adders whichare implemented using parallel prefix tree. For IBMs design,it is implemented using 8-bit KoggeStone tree. The real

    structure of each 8-bit adder block is a conditional sum adder.

    In fact, there are two levels parallel prefix tree in the 32-bitadder block as Fig. 2 shows. The first level is the 8-bit parallelprefix tree with sparseness of 2 that generates 8-bit carrysignals, propagate terms as well as conditional sums. Thesecond level is the parallel prefix tree with sparseness of 8that generates 32-bit carry signals, propagate terms and

    Figure 2 Block diagram of 32-bit adder block

    Figure 1 Block diagram of the 128-bit binary adder [3]

    308 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036

    www.ietdl.org

  • 7/28/2019 05494882

    4/11

    conditional sums. In Fig. 2, we just show oneimplementation of parallel prefix tree for the second level.

    4 Complete EAC adder designAlthough the EAC adder has been implemented on severalmicroprocessors, very few details on their formulations andarithmetic algorithms can be found in todays literature.Schwarz [2] given nice explanations about some aspects ofthe EAC adders algorithms, but some details were notincluded. In this section, we try to describe the details ofEAC adders algorithms clearly. We propose a completely

    designed EAC adder and describe its architecture. The newarchitecture makes our EAC independent without being apart of the fused multiply-add unit. Our new design mainlyfollows the algorithms of the EAC adder which isimplemented in IBM POWER6 microprocessor. Theadditional logic units of our EAC adder are useful to

    ensure the whole adder can work independently. They donot affect the key algorithms. Therefore we take our EACdesign as the example to explain the EAC addersarithmetic algorithms which makes our descriptions moreclearly and easy to read. People can understand them

    without the knowledge of other details about the IBMPOWER6s floating-point unit. Another advantage is thatour new design is easy to implement and test, which givesus the possibility to implement different architectureconfigurations and compare their properties such asperformance.

    Fig. 4 shows the architecture of the proposed EAC adder.In this adder, the inputs are two 129-bit binary addendsx= (sx127x126, . . . , x0), y= (sy127y126, . . . ,y0) and theoutputs is the sum s= (ss127s126, . . . , s0). They are all insign magnitude format. x.x, y.y,s.s are the magnitudes ofx,y,s and x.s,y.s,s.s are the corresponding sign bits. Themagnitudes of operands are used to produce the positivemagnitude of the sum and the sign bits of operands areused to produce the sign of the sum. The adder in Fig. 4

    Figure 3 Eight-bit adder block

    Figure 4 Architecture of modified EAC adder

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 309

    doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010

    www.ietdl.org

  • 7/28/2019 05494882

    5/11

    can implement four operations: x.x+y.y, x.xy.y,(x.x)+y.y and (x.x)+ (y.y).4.1 Integrating addition and subtraction

    EAC means that when subtracting two signed numbers that

    are in sign magnitude format, the subtraction is implementedby the addition of the first operand with the Booleancomplement code of the second operand. For this addition,instead of setting a carry into the least significant digit, thecarry out of the most significant digit is taken as the carryinto the least significant digit. This ensures that the resultof the addition is always a positive magnitude result andpreferably only one operand needs to be conditionallycomplemented. The EAC adder is designed to form aunique sum for every possible pair of addends. Whenadding, it is similar to other regular adders. Whensubtracting, it uses the end around carry to ensure that thesum result is always positive.

    Hence, with EAC, the adder shown in Fig. 4 shouldsatisfy the following constraints: (1) when x.s=y.s, theadder should do addition and we have s.s=x.s ands.s=x.x+y.y. (2) when x.s=y.s, the adder should dosubtraction. Ifx.xy.y, thens.s=x.s ands.s=x.xy.y;ifx.x,y.y, thens.s=y.s ands.s=y.yx.x.

    Fig. 5 shows the subtraction dataflow of our EAC adder.The algorithm is described as follows:

    1. Decide which one is bigger between

    x.x and

    y.y by

    performing an effective subtraction x.

    xy.

    y. Ifx.xy.y 0, then x.xy.y, otherwise x.x,y.y. We usey.y to denote the Boolean complement code ofy.y. Sincex.xy.y=x.x+y.y+ 1 =x.x+ 2n y.y, we have theproperty: when x.xy.y, the outgoing carry of x.x+y.y+ 1 will be 1. Therefore the outgoing carry ofx.x+y.y+ 1 which is denoted by cout can be used todecide whetherx.x is bigger than y.y. If cout= 1, thenx.xy.y; if cout= 0, thenx.x,y.y.2. Do addition

    x.x+

    y.y+ cout and compute the Boolean

    complement code of the sum to result in x.x+y.y+ cout.When x.

    xy.

    y, which means cout= 1, the subtractionx.xy.y can be rewritten as x.s=x.xy.y=x.x+y.y+ 1 =x.x+y.y+ cout. When x.x,y.y, whichmeans cout= 0, the subtraction y.yx.x can be rewritten

    as the follows

    s.s=y.yx.x= (x.xy.y) = (x.x+y.y+ 1)= (

    x.x+

    y.y) 1

    = (x.

    x+y.

    y+ 0) + 1 1 = (x.

    x+y.

    y+ 0) (2)

    With the above equation we obtain the following property:whenx.x,y.y, the output of the EAC adder is defined bythe following equation

    s.s=x.x+y.y+ cout (3)3. Finally, the outgoing carrycout is used to select the corrects.s. Whenx.xy.y, the output of the EAC adder should be

    s.s=

    x.x+

    y.y+ cout; when

    x.x,

    y.y, the output of the

    EAC adder should bes.s=x.x+y.y+ cout.After discussing how to implement the effective

    subtraction of operands x.x and y.y, we focus on theaddition of them. Actually, it is easy to implementx.x+y.y. However, we must combine the addition withthe subtraction in one single adder. Fig. 6 shows how tointegrate them.

    In Fig. 6, the Add/sub-logic unit takesx.s,y.sas the inputsand os as the output. The output os is defined by

    os=x.

    s

    y.

    s (4)

    The input logic unit takes os,y.y as the inputs and yt as theoutput. The outputyt is defined by

    yt =y.y, os= 0y.y, os= 1

    (5)

    The sign logic unit takesx.s,y.s, cout as the inputs ands.s asthe output. The outputs.s is calculated by

    s.s= (x.s^ cout) _ (y.s^ cout) (6)

    Figure 5 Subtraction dataflow of EAC adder Figure 6 Integration of addition and subtraction

    310 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036

    www.ietdl.org

  • 7/28/2019 05494882

    6/11

    In Fig. 6, when os= 0, we can use the EAC adder to doaddition x+y; when os= 1, we use the EAC adder toperform the subtraction as shown in Fig. 5.

    The inputs of the EAC adder are yt,

    x.x, os; the outputs are

    cout,s.s. When os= 0, because yt=y.y, actually, the inputsare x

    .

    x, y.

    y and the incoming carry 0; the outputs shouldbe the sum s.s=x.x+y.y and the outgoing carry cout.

    When os= 1, the inputs are yt=y.y, x.x and the

    incoming carry 1; the outputs should be the correct resultcomputed by the algorithm in Fig. 5. In this way, weperform both the addition and the subtraction using asingle adder. We can use another logic unit named EAClogic unit to implement this method.

    4.2 EAC logic unit

    Fig. 6 shows the way to combine the addition with the

    subtraction. Actually, the effective subtraction needs twoaddition operations as shown in Fig. 5. With the help ofthe EAC logic unit, we can implement the addition andthe subtraction by only one addition operation. Fig. 4shows the design.

    In Fig. 4, the input logic unit, sign logic unit and add/sub-logic unit are similar to those in Fig. 6. The four 32-bit adderblocks are all the 32-bit adder block shown in Fig. 2. In fact,the real addends of our EAC adder in Fig. 4 are two 128-bitbinary numbersx.xand yt. They are divided into four groupsas the inputs into the four 32-bit adder blocks. The four 32-

    bit adder blocks output the group-propagated carriesP127:96, . . . , P31:0, the group-generated carries G127:96, . . . ,G31:0 which are used by the EAC logic unit; they alsooutput the conditional sums s0127:0, s1127:0 which are usedby the last sum selection unit. s0127:0 is the sum under thecondition that the incoming carry is 0 while s1127:0 is thesum under the condition that the incoming carry is 1.

    In Fig. 4, the signals G127:96, P127:96, . . . , G31:0, P31:0 areused by both R logic unit and EAC logic unit. The R logicunit takes P31:0, os as the inputs and P

    t31:0 as the output.

    Pt31:0 is computed by

    Pt31:0 =P31:0, os= 1

    0, os= 0

    (7)

    The EAC logic unit takes the signals G127:96, P127:96, . . . , G31:0together with Pt31:0 as the inputs to calculate the incomingcarries into each group c0, c1, c2, c3. With the help of theabove logic units, the algorithm of EAC adder is as follows:

    1. When x.s=y.s, we have os= 1, yt =y.y, Pt31:0 = P31:0.From the formulation of carry lookahead adder, we canobtain

    c1 = G31:0 + P31:0cinc2 = G63:32 + P63:32G31:0 + P63:32P31:0cin

    c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P31:0cin

    c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0cin (8)

    Following the first step ofFig. 5, we know thatx.x+y.y+ 1should be done and the outgoing carrycout should be used todecide whetherx.x is bigger thany.y or not. Thus, by theabove equations, assuming cin = cin1 = 1, cout can becomputed as

    cout= c0 = G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0 (9)

    Then, for the second addition which meansx.

    x+y.

    y+ cout,we take cout= c0 as the incoming carry. Using theformulations of carry lookahead adder again, we can obtaingroup carry signals as

    c1 = G31:0 + P31:0cout= G31:0 + P31:0 {G127:96 + P127:96G95:64+ P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0}

    =

    G31:0+

    P31:0G127:96+ P127:96P31:0G95:64 + P127:96P95:64P31:0G63:32+ P127:96P95:64P63:32P31:0

    c2 = G63:32 + P63:32G31:0 + P63:32P31:0cout= G63:32 + P63:32G31:0+ P63:32P31:0G127:96 + P127:96P63:32P31:0G95:64+ P127:96P95:64P63:32P31:0

    c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P31:0cout

    = G95:64 + P95:64G63:32 + P95:64P63:32G31:0

    + P95:64P63:32P31:0G127:96+ P127:96P95:64P63:32P31:0

    c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0cout

    = G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0

    (10)

    c

    0, c

    1, c

    2, c

    3 can be used to select the correct sumx.x+y.y+ cout= sum127:0. In the following, we will showhow the EAC logic unit completes the task mentioned above.

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 311

    doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010

    www.ietdl.org

  • 7/28/2019 05494882

    7/11

    Definition 4.1 (EAC logic unit): The EAC logic unittakes the signals G127:96, P127:96, . . . , G31:0, P

    t31:0 as the

    inputs and c0, c1, c2, c3 as the outputs. The outputs aredefined as follows

    c1 = G31:0 + Pt31:0G127:96 + Pt31:0P127:96G95:64+ Pt31:0P127:96P95:64G63:32 + P127:96P95:64P63:32P

    t31:0

    c2 = G63:32 + P63:32G31:0 + Pt31:0P63:32G127:96

    + Pt31:0P127:96P63:32G95:64 + P127:96P95:64P63:32Pt31:0

    c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P

    t31:0G127:96 + P127:96P95:64P63:32P

    t31:0

    c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0 + P127:96P95:64P63:32P

    t31:0

    (11)

    As we know, when x.s=y.s, we have os= 1 andPt31:0 = P31:0. So, for EAC logic unit, the equations ofcalculating c0, c1, c2, c3 can be rewritten as

    c1 = G31:0 + P31:0G127:96 + P31:0P127:96G95:64+ P31:0P127:96P95:64G63:32 + P127:96P95:64P63:32P31:0

    c2 = G63:32 + P63:32G31:0 + P31:0P63:32G127:96+ P31:0P127:96P63:32G95:64 + P127:96P95:64P63:32P31:0

    c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+

    P95:64P63:32P31:0G127:96+

    P127:96P95:64P63:32P31:0c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32

    + P127:96P95:64P63:32G31:0 + P127:96P95:64P63:32P31:0

    (12)

    In this case, it is easy to find that the equations of calculatingc0, c1, c2, c3 are equivalent to the formulations of computingc0, c

    1, c

    2, c

    3 above. Therefore the end-around-logic unit canbe used to implement the subtraction dataflow shown inFig. 5 by only one addition. Furthermore, os and c0 can beused to select the correct sum

    s.s from sum127:0 and sum127:0

    according to the following rules:

    Whenx.xy.y, we have os= 1, c0 = 1, cout= c0 = 1. Asa result, sum127:0 =x.x+y.y+ 1, and the sum is selected ass.s= sum127:0 =x.x+y.y+ 1.

    When x.x,y.y, os= 1, c0 = cout= 0. we havesum127:0 =x.x+y.y+ 0. Then, the sum iss.s= sum127:0 =x.x+y.y+ 0.2. When x

    .

    s=y.

    s, we should do the addition x.

    x+y.

    y.Taking the formulations of carry lookahead adder andassuming the incoming carry cin = 0, the group carries can

    be calculated as follows

    c1 = G31:0 + P31:0cin = G31:0c2 = G63:32 + P63:32G31:0 + P63:32P31:0cin= G63:32 + P63:32G31:0

    c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P31:0cin

    = G95:64 + P95:64G63:32 + P95:64P63:32G31:0c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32

    + P127:96P95:64P63:32G31:0 + P127:96P95:64P63:32P31:0cin= G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0

    (13)

    With cin = 0, c

    1 , c

    2 , c

    3 , we can select the correct sum

    x.x+y.y from the outputs s0127:0 and s1127:0 and c0 is theoutgoing carry.On the other hand, because ofx.s=y.s, we have os= 0,

    yt =y.y, Pt31:0 = 0, the EAC logic units formulations canbe rewritten as follows

    c1 = G31:0 + Pt31:0G127:96 + P

    t31:0P127:96G95:64

    + Pt31:0P127:96P95:64G63:32 + P127:96P95:64P63:32Pt31:0

    = G31:0 + 0 G127:96+ 0 P127:96G95:64 + 0 P127:96P95:64G63:32+ P

    127:96

    P95:64

    P63:32

    0

    = G31:0c2 = G63:32 + P63:32G31:0

    + Pt31:0P63:32G127:96 + Pt31:0P127:96P63:32G95:64

    + P127:96P95:64P63:32Pt31:0

    = G63:32 + P63:32G31:0c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0

    + P95:64P63:32Pt31:0G127:96 + P127:96P95:64P63:32P

    t31:0

    = G95:64 + P95:64G63:32 + P95:64P63:32G31:0c0 = G127:96 + P127:96G95:64

    +

    P127:96P95:64G63:32+

    P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P

    t31:0

    = G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0

    (14)

    We can see that the equations calculating c0 , c

    1 , c

    2 , c

    3 aresame to the equations calculating c0, c1, c2, c3. Therefore c1,c2, c3 can be used to select the correct sum. Here, the firstgroup is a special case. sum31:0 is not only controlled by c0,but also controlled by os

    sum31:0 =s131:0, c0 ^ os= 1s031:0, others

    (15)

    312 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036

    www.ietdl.org

  • 7/28/2019 05494882

    8/11

    By this way, when x.s=y.s, whatever the value of c0 is, wealways have sum127:0 =x.x+y.y. The EAC logic unit canimplement the simple addition x.x+y.y correctly.Furthermore, os and c0 can also be used to select the correctsum in subtraction dataflow discussed above.

    In this way, the end around carry logic unit can combinethe addition and subtraction correctly by doing onlyone addition operation. In paper [2], the formulation ofthe EAC adder is similar, but some details of thealgorithms were not explained, and the EAC logic unitis introduced as a part of the fused multiply-add unit.

    This means it cannot do the addition independently.Our design given in Fig. 4 can perform the additionindependently. So, it is easy to verify the correctness ofthe adders algorithms.

    5 Implementation and validationFrom the arithmetic algorithms discussed above, we knowthat for the 32-bit adder block in IBMs EAC adderdesign, the first level and the second level parallel prefixtree is a Kogge Stone tree. Comparing to Ladner Fischertree, the KoggeStone tree design is a better choice on

    ASIC technology. Here, we try to find the best choice onFPGA technology. In this paper, our proposed EAC adderfollows all the key algorithms of IBMs design, theadditional logic units mainly are used to ensure that theEAC adder can work independently. Thus, it is not only

    useful to implement and test the EAC adder easily, butalso useful as a reference to find a better design for theEAC adder used in fused multiply-add unit. We willimplement different parallel prefix trees architectureconfigurations in our EAC adder and report the simulationresults.

    Knowles [7] has presented complete classes of regular fan-out prefix adders which are bounded at the extremes by theKoggeStone tree and LadnerFischer tree. For our study,using PFGA technology, we choose the regular parallelprefix trees of Knowless adder family and otherbasic parallel prefix trees to implement the first level 8-bit

    parallel prefix tree as depicted in Fig. 3. These chosenparallel prefix trees are Kogge Stone; Ladner Fischer;Brent Kung; Han Carlson; Konwles [1, 1, 4]; Konwles[1, 2, 2]; Konwles [1, 1, 2].

    Then, for the second level parallel prefix tress in Fig. 2, wealso choose Konwles [1, 1] and Konwles [1, 2] in Konwlessadder family to implement them, respectively. These adders

    were selected because they span the design limits andintermediate cases in terms of area, depth of prefixnetwork, fan-out and interconnect count. The notionsintroduced in Section 3 are helpful to understand how

    these parallel prefix trees work. However, we should changetheir regular implementation to ensure that they can workcorrectly in the EAC adder.

    5.1 Parallel prefix tree design in EACadder

    To implement a parallel prefix tree, we need half-adder tocalculate generated-carry and propagated-carry at each bitposition. Then, using these carry signals, we need some

    other cells to compute group-generated carries and group-propagated carries. Fig. 7 shows some gate-level basiccells which calculate group-propagated carry Pi:j and group-generated carryGi:j in the parallel prefix trees intermediatestages. In Fig. 7, the quadrate cell calculates Pi:j and Gi:jsimultaneously whereas the triangular cell just calculatesGi:j. Therefore the circuit of the quadrate cell is morecomplex than that of the triangular cell.

    With the help of these basic cells, the roughimplementation of BrentKung tree is shown in Fig. 8.

    We use HAi (0 i 7) to denote Half adder. Here, we

    do not take the buffers into account. Here, for a regularparallel prefix adder which does addition of two addends,

    Figure 7 Basic cells in parallel prefix tree

    Figure 8 Eight-bit BrentKung tree

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 313

    doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010

    www.ietdl.org

  • 7/28/2019 05494882

    9/11

    we always assume that the incoming carry into this adder isc0 = 0. For two N-bit binary addends x= (xn1xn2, . . . ,x0), y= (yn1yn2, . . . ,y0), the formulations of computingcarry and sum at bit position i in parallel prefix tree areci = Gi1:0 _ (Pi1:0 ^ c0), si= Pi ci, where 0 in 1. Because c0 = 0, we have ci = Gi1:0 _ (Pi1:0 ^

    c0) = Gi1:0. That is why we can use two different basiccells in Fig. 7 to build the regular Brent Kung tree inFig. 8. The idea is that sometimes only the signal Gi1:0 isneeded, therefore the triangular cell which is more simplecan be used to reduce the complexity. Vitoroulis [18]compared the performance and area for regular parallelprefix trees which are implemented on FPGA technology.But when the parallel prefix trees are implemented ascomponents of our EAC adder in Fig. 4, they cannot bedesigned in the regular way shown in Fig. 8. Both Gi:0 andPi:0 should be kept as the outputs for reuse in the nextstage. For example, if we want to use BrentKung tree as

    the component in the EAC adder, which means the parallelprefix tree in Fig. 3 is implemented using BrentKung tree,

    we can only use the quadrate cell to calculate the signals inthe intermediate stages. We must change the regular designof Brent Kung tree shown in Fig. 8. Fig. 9 shows therough architecture of the modified BrentKung tree adopted.

    Therefore on FPGA technology, the properties of thedifferent parallel prefix trees such as area and performance

    will be different from the results listed in Vitorouliss report.

    As a result, if we implement different parallel prefix trees inour EAC adder, we should first change the implementationof the parallel prefix tree itself; then, we also should takeinto account the relationship between the parallel prefixtrees and the other parts of the EAC adder.

    5.2 Experimental results

    In this section, we present the simulation, synthesis andimplementation results. EAC adders were designed using

    the various tree structures which are already discussed inthis paper. They are firstly coded in VHDL in twodifferent levels and then all 14 different architectureconfigurations are modeled in the Aldec Active HDLsimulation environment. The adder functionality wassuccessfully verified using 100 000 random test vectors.

    After functional verification, all the 14 adder architectureswere implemented on a high performance Virtex II-PROXilinx FPGA (XC2VP100) chip in Xilinx ISE synthesiserenvironment. We measured the area of an implementeddesign in terms of the number of FPGA slices taken by theimplemented design, and the speed performance in termsof the longest signal path or critical path delay of the design(ns). The area and speed results are compared in Figs. 1013.

    These results show that, we achieve minimum area whenusing the 32-bit Knowles [1, 1] tree and 8-bit LadnerFischer tree configuration and the maximum area when using

    the 32-bit Knowles[1, 1] tree and 8-bit Knowles [1, 1, 2] treeconfiguration (Fig. 13), which is 18% larger.

    By comparing the critical delay results of various EACadders in Fig. 10, we can find that the 32-bit Knowles[1, 2] tree and 8-bit HanCarlson tree configuration hasthe lowest delay; the 32-bit Knowles [1, 1] tree and 8-bit

    Figure 9 Eight-bit BrentKung tree in EAC adder

    Figure 10 Critical path delay, logic delay and route

    delay (ns)

    Figure 11 Logic delay (ns)

    314 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036

    www.ietdl.org

  • 7/28/2019 05494882

    10/11

    LadnerFischer tree configuration has the maximum delaywhich is about 22.5% larger. As we know, the critical path

    delay has two main components, the logic delay and therouting delay. It can be seen in Fig. 10 that the routingdelay for all adders is more than the logic delay, with verylittle variation. Here, the wiring (Routing) is automaticallychosen by the synthesiser tools. Sometimes it can beoptimised at the final phase of any design manually orusing other methods to decrease it. But sometimes it is

    very hard to do this optimisation. Although the logic delayis always related to the routing delay, in Figs. 11 and 12 westill compare them separately. The results show the 32-bitKnowles [1, 2] tree and 8-bit HanCarlson treeconfiguration also has the lowest logic delay, but not the

    lowest routing delay; the 32-bit Knowles [1, 1] tree and8-bit LadnerFischer tree configuration seems to have themaximum logic delay and the maximum routing delay.

    Finally the 32-bit Knowles [1, 2] and 8-bit HanCarlsonconfiguration seems to be the best compromise between areaand speed. Even though the occupied area is about 3% largerthan the minimum, it is more than compensated by asignificant increase in terms of the speed.

    It is also known that FPGA have built-in carry logic basedon fast-carry computations which outperforms parallel prefix

    adders in both area and delay[18]. This is mainly because thebuilt-in carry logic in FPGA can use a high speed bus topropagate the carry. In our experiments, the logic delay of

    built-in carry logic is about 13.7 ns (Fig. 11), longer thanthat of parallel prefix adders, which are about 10 ns.However, the routing delay of built-in carry logic is only3.4 ns. In contrast, the routing delay of BrentKung adder,

    which is almost the minimum among all parallel prefixadders, is 11.8 ns. In summary, the total delay of built-in

    carry logic is 17.1 ns, less than that of BrentKung adder,which is 21.9 ns. This result validates that built-in carrylogic is a better choice in FPGA than parallel prefix adder.However, in an EAC adder, we do not only use the sumsignals from the adder, but also need the group propagatedcarries and group generated carries, which can only beobtained from parallel prefix adders. That is to say, in orderto port the EAC adder to FPGA, the use of parallel prefixtree is still required. To achieve a better implementation,experiments over different parallel prefix trees are helpful tofind the optimal solution.

    For the power consumption, the Xilink power estimationtool, XPOWER, gives very rough estimations. For allimplementations of the EAC adders the power dissipation

    was estimated approximately 572 mW. We notice thatVitoroulis also did not list the power consumptions forregular parallel prefix trees [18]. Therefore we will keeplooking for better tools that can report precise powerdissipation and consider the power consumption as ametric in future direction. But right now, based thesimulation results we have, we may say KoggeStone treeis not a better choice as in ASIC technology. Compared toother parallel prefix trees, Kogge Stone implementationhas longer delay, bigger area and similar power consumption.

    6 Conclusion

    In this paper, we proposed a complete design of a binaryfloating-point EAC adder and explained the details of itsarithmetic algorithms. Our EAC adders algorithms mainlyfollow a 128-bit binary floating-point adder which isimplemented in the IBM POWER6 microprocessor.Compared to the IBMs design, our EAC adder can workindependently, which makes it easy to implement and test.Because there are few details of the EAC adders arithmeticalgorithms in todays literature, our paper can help

    designers to understand this arithmetic unit well. Then, westudied the performance of parallel prefix treesimplemented in our EAC adder with FPGA technology.

    After analysing the relationships between parallel prefixtrees and other parts of the EAC adder, we modified theimplementation of regular parallel prefix trees to ensure thatthey are able to be used within the EAC adder correctly.By comparing the areas and performances of 14 differentparallel prefix trees architecture configurations, we foundthat the 32-bit Knowles [1, 1] and 8-bit LadnerFischerconfiguration has the minimum area while the 32-bitKnowles [1, 2] and 8-bit HanCarlson configuration has

    the minimum critical path delay. Although the occupiedarea is about 3% larger than the minimum, the 32-bitKnowles [1, 2] and 8-bit HanCarlson configuration

    Figure 12 Routing delay (ns)

    Figure 13 Number of slices

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 315

    doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010

    www.ietdl.org

  • 7/28/2019 05494882

    11/11

    seems to be the best compromise between area and speed forthe FPGA implementation.

    7 References

    [1] CURRAN B., MCCREDIE B., SIQAL L., ET AL.: 4GHz+ low-latencyfixed-point and binary floating-point execution units for the

    POWER 6 processor. Digest of 2006 IEEE Int. Solid-State

    Circuits Conf., 2006, pp. 17281734

    [2] SCHWARZ E.M.: Binary floating-point unit design,

    in U . S. S . ( E D. ) : High performance energy efficient

    microprocessor design (Springer, 2006), pp. 189208

    [3] YU X.Y., FLEISCHER B., CHAN Y.H., ET AL.: A 5 GHz+ 128-bit

    binary floating-point adder for the POWER 6 processor.

    Proc. Int. Conf. 32nd European Solid-State Circuits, 2006,

    pp. 166169

    [4] LEOBANDUNG D.M.E., NAYAKAMA H., ET AL.: High performance

    65 nm SOI technology with dual stress liner and low

    capacitance sram cell. Digest of 2005 Symp. on VLSI

    Technology, 2005

    [5] KOGGE P.M., STONE H.S.: A parallel algorithm for the

    efficient solution of a general class of recurrence

    equations, IEEE Trans. Comput., 1973, 22, (8), pp. 786793

    [6] LADNER R., FISCHER M.: Parallel prefix computation,

    J. ACM, 1980, 27, (4), pp. 831838

    [7] KNOWLES S.: A family of adders. Proc. 15th IEEE Symp.

    on Computer Arithmetic, 2001, pp. 277281

    [8] OKLOBDZIJA V.G., VILLEGER D.: Improving multiplier design

    by using improved column compression tree and

    optimized final adder in CMOS technology, IEEE Trans.

    VLSI Syst., 1995, 3, (2)

    [9] STELLING P., OKLOBDZIJA V.G.: Design strategies for

    optimal hybrid final adders in a parallel multiplier, J. VLSI

    Signal Process. (Special issue on VLSI Arithmetic), 1996,

    14, (3)

    [10] ZEYDEL B.R., OKLOBDZIJA V.G., MATHEW S., KRISHNAMURTHY R.K.,

    B ORKAR S.: A 90 nm 1 GHz 22 mW 16 16-bit 2s

    complement multiplier for wireless baseband. Proc. 2003

    Symp. on VLSI Circuits, 2003

    [11] SHEDLETSKY J.J.: Commenton on the sequential and

    indeterminate behavior of an end-around-carry adder,

    IEEE Trans. Comput., 1977, pp. 271271

    [12] ZHANG X.Y., CHAN Y.H., MONTOYE R., ET AL.: A 270 ps 20 mW

    108-bit end-around carry adder for multiply-add fused

    floating point unit, J. Signal Process. Syst., 2009

    [13] SKLANSKY J.: Conditional-sum addition logic, IRE Trans.

    Electronic Comput., 1960, EC-9, pp. 226231

    [14] BRENT R.P., KUNG H.T.: A regular layout for parallel adders,

    IEEE Trans. Comput., 1982, C, (31), pp. 260264

    [15] HAN T., CARLSON D.: Fast area-efficient VLSI adders. Proc.

    Eighth Symp. Comp, 1987, pp. 4956

    [16] ZIEGLER M.M., STAN M.R.: A unified design space for

    regular parallel prefix adders. Proc. Design, Automation

    and Test in Europe Conf. and Exhibition (DATE04), 2004,

    pp. 13861387

    [17] LIU J.H., ZHU Y., ZHU H.K., ET AL.: Optimum prefix adders in

    a comprehensive area, timing and power design space.

    Proc. 12th Conf. on Asia South Pacific Design Automation

    (ASP-DAC07), 2007, pp. 609615

    [18] VITOROULIS K., AI-KHALILI A.J.: Performance of parallel

    prefix adders implemented with FPGA technology. IEEE

    Northeast Workshop on Circuits and Systems, 2007,

    pp. 498501

    [19] KOREN I.: Computer arithmetic algorithms (A.K. Peters,

    Natick, MA, 2002)

    316 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036

    www.ietdl.org