Upload
sandeep-anugandula
View
212
Download
0
Embed Size (px)
Citation preview
7/28/2019 05494882
1/11
Published in IET Computers & Digital Techniques
Received on 27th March 2009
Revised on 27th September 2009
doi: 10.1049/iet-cdt.2009.0036
ISSN 1751-8601
Field programmable gate array prototyping
of end-around carry parallel prefixtree architectures
F. Liu1
Q. Tan1
G. Chen2
X. Song3
O. Ait Mohamed4
M. Gu5
1National Lab of Parallel Distributed Processing, Hunan, China2
Lingcore Lab, Portland, OR, USA3ECE Department, Portland State University, Portland, OR, USA4ECE Department, Concordia University, Montreal, Quebec, Canada5
School of Software, TsingHua University, Beijing, China
E-mail: [email protected]
Abstract: As an important part of many processorss floating point unit, fused multiply-add unit performs a
multiplication followed immediately by an addition. In IBM POWER6 microprocessors fused multiply-add unit,
a fast 128-bit floating-point end-around-carry (EAC) adder is proposed. Very few algorithmic details exist in
todays literature about this adder. In this study, a complete designed EAC adder that can work independently
as a regular adder is proposed. Details about the proposed EAC adders arithmetic algorithms are described.In IBMs original EAC adder, the KoggeStone tree has been chosen for its high performance on ASIC
technology. In this study, the authors present a comparative study on different parallel prefix trees which are
used in the design of our new EAC adder targeting field programmable gate array (FPGA) technology. Our
study highlights the main performance differences among 14 different architecture configurations focusing on
the area requirements and the critical path delay. The experimental results show that there is one
architecture configuration with the lower area requirement and the higher performance.
1 Introduction
Fused multiply-add unit plays an important role in modern
microprocessor. It performs floating-point multiplicationfollowed immediately by an addition of the product with athird floating-point operand. In 2007, a seven-cycle fusedmultiply-add pipeline unit was proposed [1] as a part ofthe floating-point unit in IBMs POWER6microprocessor. In this fused multiply-add dataflow, theproduct should be aligned before it is added with theaddend. Because the magnitude of the product is unknownin the early stages prior to the combination with theaddend, it is difficult to determine a priori which operandis bigger [2]. Even if it was determined early that theproduct was bigger, there would be a problem on
conditionally complementing two intermediate operands,the carry and sum outputs of the counter tree. Thus, anadder needs to be designed to always output a positive
magnitude result and preferably only needs to conditionallycomplement one operand [2]. Therefore a new 128-bitend-around carry (EAC) adder was designed and fabricated
in IBMs fused multiply-add unit [3]. The intention is notto produce an adder with the best stand-alone performancebut to provide the one with the best overall floating-pointperformance [3].
IBM implemented its EAC adder in a 65 nm SOItechnology [4] and some sub-components are implementedusing KoggeStone tree [5]. In fact, the LadnerFischertree [6] was used in IBMs first pass test chip. Comparedto Ladner Fischer design, the KoggeStone design isabout 0.5 FO4 faster with only 6% area overhead and 5%power increase [3]. Therefore the KoggeStone tree was
chosen in the final design. Besides KoggeStone tree andLadnerFischer tree, it is known that there are many othervariations of parallel prefix trees [7]. The motivation for our
306 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
7/28/2019 05494882
2/11
work was to find the best EAC adder for use in a fusedmultiply-add unit.
We also notice that field programmable gate array (FPGA)technology has recently enjoyed a rapidly increasingpopularity. With nanotechnology era, the logic density of
FPGA has increased dramatically. Because the fixedstructure and large variety of resources of FPGA possessthe potential to affect significantly the implementationresults. One interesting thing is to check whether the EACadder can work well and to study the performancedifferences among different architecture configurationsfocusing on the area requirements and the critical pathdelay on FPGA technology.
Since it would be difficult to evaluate the full floating-point performance, in this paper, we propose a completedesigned EAC adder that can work independently without
being a part of the fused multiply-add unit. Very fewdescription on EAC adders formulations exist in todaysliterature, therefore details of the proposed EAC addersarithmetic algorithms are explained. Because the algorithmsof our EAC adder mainly follows the IBM EAC addersarithmetic algorithms and can be read without theknowledge of the whole floating-point unit, it is our beliefthat our description would be helpful for people to get abetter understanding about the nature of the EAC design.
To make our EAC adder can work as a regular adder, wedesign some new logic units such as input logic unit, signlogic unit and so on. This design makes it easier toimplement and test other design choices. On the otherhand, the additional logic units do not affect the EACadders key behaviours, evaluations of our EAC addersdifferent designs has relevance to fused multiply-add unitdesign. We study the performance of EAC adder withdifferent parallel prefix trees on FPGA technology. Theexperimental results show that there is one architectureconfiguration with the lower area requirement and thehigher performance.
The paper is organised as follows. In Section 2, the relatedworks are reviewed. In Section 3, some preliminaries and thealgorithms of the 32-bit adder block are presented. Section 4
describes the architecture of our proposed 128-bit EACadder and its arithmetic algorithms. Section 5 explains theimplementation of different parallel prefix trees in ourEAC adder and reports the simulation results. Section 6concludes this paper.
2 Related work
In the past few years, several adders used in the fusedmultiply-add operation have been proposed [810]. Theseadder schemes are based on delay profile of the multiplycompression tree. At a result, they are power efficient only
when the final addition is performed right after thecompression tree and when the EAC computation is notneeded [3]. For higher floating-point performance, the
EAC adder is used in recent processors. Although theEAC adder has become common hardware designpractices, this technique has not been well documented.Shedletsky [11] analysed some behaviours of EAC adderusing some real circuits examples. Yu et al. [3] proposed afast 128-bit EAC adder which is fabricated as part of the
IBM POWER6 microprocessor. They described theadders architecture and analysed its performance andpower dissipation. Zhang et al. [12] presented a 108-bitEAC adder which is also used by a fused multiply-addunit. Structure-aware layout techniques were used tooptimise their adders structure. All the works abovefocused on the EAC adders architecture design, whiledetails of its arithmetic algorithms were not explained.Schwarz [2] discussed some aspects of the EAC addersalgorithms, but some details were still not included.
On the other hand, parallel prefix tree is recently used as a
subcomponent of the EAC adder. There are many classicparallel prefix adders that have been proposed, includingSklansky [13], KoggeStone and BrentKung [14]. Theseprefix networks achieve three extreme goals: minimal logiclevels and wire tracks, minimal max-fanout and logic levels,and minimal wire tracks and max-fanout, respectively. Inaddition, LadnerFischer, HanCarlson [15] and Knowles[7] implemented the trade-off between each pair of theextreme cases. Structure of the prefix network determinesthe type of the prefix adder. Ziegler et al. [16] consideredsparsity, fanout and radix as three dimensions in the designspace of regular parallel prefix adders and presented aunified formalism to describe such structures. Liu et al.[17] studied how to find optimal prefix structures forspecific applications and proposed an integer linearprogramming method to build minimal-power prefixadders within a given timing and area constraints. In IBMPOWER6s EAC adder, by chip test, it was found thatKogge Stone tree was a better choice than LadnerFischer tree. The works discussed above are based on ASICtechnology. Vitoroulis et al. [18] investigated theperformance of parallel prefix adders implemented withFPGA technology. It reported on the area requirementsand critical path delay for a variety of classical parallel prefixadder structures. However, parallel prefix trees were
implemented as a single adder, without being a part ofbigger designs. In our work, we try to answer thesequestions: What are the arithmetic algorithms of EACadder with parallel prefix tree? If we use different parallelprefix tree in the EAC adder on FPGA technology, whichone is better? How parallel prefix trees affect the otherparts of EAC adder? As a part of EAC adder, should theimplementation of parallel prefix tree itself be changed?
3 Thirty-two-bit adder block inEAC adder
In this paper, the symbols 0 and 1 denote Boolean false andtrue, or digital number zero and one, respectively; the symbol
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 307
doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010
www.ietdl.org
7/28/2019 05494882
3/11
^ denotes the Boolean AND; _ (or + ) denotes theBoolean OR; denotes the Boolean Exclusive OR. Abinary number of length n (n 1) is an ordered sequenceof binary bits where each bit can assume one of the values0 or 1. For traditional integer adder, we usex= (xn1xn2, . . . , x1x0), y= (yn1yn2, . . . ,y1y0) to
denote the two n-bit addends and s= (sn1sn2, . . . , s0) todenote the corresponding sum (n 1); xi, yi, si denote thebinary bits ofx, y, s at position i, where 0 i n 1. Letc= {cn, cn1, . . . , c0} be the corresponding set of carries
where c0 is the initial incoming carry, ci denotes the carryform the bit position i 1 and cn is the outgoing carry.
To explain the adders algorithm, some standard notionssuch as propagated carry, generated carry, group-propagatedcarry and group-generated carry should be introduced.
These notions are related to parallel prefix trees and theirdefinitions can be found in Koren [19]. In this paper, we
use Pi = xiyi, Gi= xi^yi (for simplicity, Gi= xiyi) todenote the propagated carry and generated carry at bitposition i, respectively. We use Pi:j, Gi:j to denote thegroup-propagated carry and group-generated carry for thebit positions i, i 1, . . . , j, respectively.
The notation of carry select adder is also important. Forthe group that consists of k bit positions starting with bitposition j and ending with bit position i, wherei=j+ k 1, the outputs of carry select adder are the sumbits si, si1, . . . , sj and the outgoing carry ci+1. Theseoutputs can be selected by the incoming carry into thisgroup c
j
as follows
ci+1 = [c0i+1 ^ cj] _ [c
1i+1 ^ cj]
sm = [s0m ^ cj] _ [s
1m ^ cj] (m =j, j+ 1, . . . , i)
(1)
cj is the Boolean complement code of cj; s0m is the sum bit at
bit position m under the condition that the incoming carry is0 and c0i+1 is the corresponding outgoing carry; s
1m, c
1i+1 are the
sum bit at position m and the outgoing carry under thecondition that the incoming carry into the group is1. Other useful notions and formulations about parallelprefix trees and carry select adder can be found in Koren [19].
Since the carry signal is on the critical path, to obtain ahigh performance floating-point unit, a 128-bit EAC adder
was designed in IBM POWER6 microprocessor. Fig. 1shows its block diagram.
This adder is divided into three sub-blocks: the 32-bitadder block, the EAC logic block and the final sumselection block [3]. Each 32-bit adder block is alsopartitioned into three sub-components. The first sub-component is an 8-bit prefix-2 Kogge stone tree withsparseness of 2 that generates 8-bit propagates as well as
conditional sums that are needed later for sum selection.The second sub-component is a prefix-2 Kogge stone treewith sparseness of 8 that generates 32-bit propagated terms
as well as 32-bit conditional sums. The last sub-component
is a sum selection block [3].
From Fig. 1 we know that the 128-bit EAC adder iscomposed of four 32-bit adder blocks. Each 32-bit adderblocks architecture is shown in Fig. 2. Each 32-bit adderblock is actually a carry select adder consisting of four 8-bitadder blocks. Each 8-bit adder block has the structuredepicted in Fig. 3.
In each 8-bit adder block, there are two 8-bit adders whichare implemented using parallel prefix tree. For IBMs design,it is implemented using 8-bit KoggeStone tree. The real
structure of each 8-bit adder block is a conditional sum adder.
In fact, there are two levels parallel prefix tree in the 32-bitadder block as Fig. 2 shows. The first level is the 8-bit parallelprefix tree with sparseness of 2 that generates 8-bit carrysignals, propagate terms as well as conditional sums. Thesecond level is the parallel prefix tree with sparseness of 8that generates 32-bit carry signals, propagate terms and
Figure 2 Block diagram of 32-bit adder block
Figure 1 Block diagram of the 128-bit binary adder [3]
308 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
7/28/2019 05494882
4/11
conditional sums. In Fig. 2, we just show oneimplementation of parallel prefix tree for the second level.
4 Complete EAC adder designAlthough the EAC adder has been implemented on severalmicroprocessors, very few details on their formulations andarithmetic algorithms can be found in todays literature.Schwarz [2] given nice explanations about some aspects ofthe EAC adders algorithms, but some details were notincluded. In this section, we try to describe the details ofEAC adders algorithms clearly. We propose a completely
designed EAC adder and describe its architecture. The newarchitecture makes our EAC independent without being apart of the fused multiply-add unit. Our new design mainlyfollows the algorithms of the EAC adder which isimplemented in IBM POWER6 microprocessor. Theadditional logic units of our EAC adder are useful to
ensure the whole adder can work independently. They donot affect the key algorithms. Therefore we take our EACdesign as the example to explain the EAC addersarithmetic algorithms which makes our descriptions moreclearly and easy to read. People can understand them
without the knowledge of other details about the IBMPOWER6s floating-point unit. Another advantage is thatour new design is easy to implement and test, which givesus the possibility to implement different architectureconfigurations and compare their properties such asperformance.
Fig. 4 shows the architecture of the proposed EAC adder.In this adder, the inputs are two 129-bit binary addendsx= (sx127x126, . . . , x0), y= (sy127y126, . . . ,y0) and theoutputs is the sum s= (ss127s126, . . . , s0). They are all insign magnitude format. x.x, y.y,s.s are the magnitudes ofx,y,s and x.s,y.s,s.s are the corresponding sign bits. Themagnitudes of operands are used to produce the positivemagnitude of the sum and the sign bits of operands areused to produce the sign of the sum. The adder in Fig. 4
Figure 3 Eight-bit adder block
Figure 4 Architecture of modified EAC adder
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 309
doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010
www.ietdl.org
7/28/2019 05494882
5/11
can implement four operations: x.x+y.y, x.xy.y,(x.x)+y.y and (x.x)+ (y.y).4.1 Integrating addition and subtraction
EAC means that when subtracting two signed numbers that
are in sign magnitude format, the subtraction is implementedby the addition of the first operand with the Booleancomplement code of the second operand. For this addition,instead of setting a carry into the least significant digit, thecarry out of the most significant digit is taken as the carryinto the least significant digit. This ensures that the resultof the addition is always a positive magnitude result andpreferably only one operand needs to be conditionallycomplemented. The EAC adder is designed to form aunique sum for every possible pair of addends. Whenadding, it is similar to other regular adders. Whensubtracting, it uses the end around carry to ensure that thesum result is always positive.
Hence, with EAC, the adder shown in Fig. 4 shouldsatisfy the following constraints: (1) when x.s=y.s, theadder should do addition and we have s.s=x.s ands.s=x.x+y.y. (2) when x.s=y.s, the adder should dosubtraction. Ifx.xy.y, thens.s=x.s ands.s=x.xy.y;ifx.x,y.y, thens.s=y.s ands.s=y.yx.x.
Fig. 5 shows the subtraction dataflow of our EAC adder.The algorithm is described as follows:
1. Decide which one is bigger between
x.x and
y.y by
performing an effective subtraction x.
xy.
y. Ifx.xy.y 0, then x.xy.y, otherwise x.x,y.y. We usey.y to denote the Boolean complement code ofy.y. Sincex.xy.y=x.x+y.y+ 1 =x.x+ 2n y.y, we have theproperty: when x.xy.y, the outgoing carry of x.x+y.y+ 1 will be 1. Therefore the outgoing carry ofx.x+y.y+ 1 which is denoted by cout can be used todecide whetherx.x is bigger than y.y. If cout= 1, thenx.xy.y; if cout= 0, thenx.x,y.y.2. Do addition
x.x+
y.y+ cout and compute the Boolean
complement code of the sum to result in x.x+y.y+ cout.When x.
xy.
y, which means cout= 1, the subtractionx.xy.y can be rewritten as x.s=x.xy.y=x.x+y.y+ 1 =x.x+y.y+ cout. When x.x,y.y, whichmeans cout= 0, the subtraction y.yx.x can be rewritten
as the follows
s.s=y.yx.x= (x.xy.y) = (x.x+y.y+ 1)= (
x.x+
y.y) 1
= (x.
x+y.
y+ 0) + 1 1 = (x.
x+y.
y+ 0) (2)
With the above equation we obtain the following property:whenx.x,y.y, the output of the EAC adder is defined bythe following equation
s.s=x.x+y.y+ cout (3)3. Finally, the outgoing carrycout is used to select the corrects.s. Whenx.xy.y, the output of the EAC adder should be
s.s=
x.x+
y.y+ cout; when
x.x,
y.y, the output of the
EAC adder should bes.s=x.x+y.y+ cout.After discussing how to implement the effective
subtraction of operands x.x and y.y, we focus on theaddition of them. Actually, it is easy to implementx.x+y.y. However, we must combine the addition withthe subtraction in one single adder. Fig. 6 shows how tointegrate them.
In Fig. 6, the Add/sub-logic unit takesx.s,y.sas the inputsand os as the output. The output os is defined by
os=x.
s
y.
s (4)
The input logic unit takes os,y.y as the inputs and yt as theoutput. The outputyt is defined by
yt =y.y, os= 0y.y, os= 1
(5)
The sign logic unit takesx.s,y.s, cout as the inputs ands.s asthe output. The outputs.s is calculated by
s.s= (x.s^ cout) _ (y.s^ cout) (6)
Figure 5 Subtraction dataflow of EAC adder Figure 6 Integration of addition and subtraction
310 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
7/28/2019 05494882
6/11
In Fig. 6, when os= 0, we can use the EAC adder to doaddition x+y; when os= 1, we use the EAC adder toperform the subtraction as shown in Fig. 5.
The inputs of the EAC adder are yt,
x.x, os; the outputs are
cout,s.s. When os= 0, because yt=y.y, actually, the inputsare x
.
x, y.
y and the incoming carry 0; the outputs shouldbe the sum s.s=x.x+y.y and the outgoing carry cout.
When os= 1, the inputs are yt=y.y, x.x and the
incoming carry 1; the outputs should be the correct resultcomputed by the algorithm in Fig. 5. In this way, weperform both the addition and the subtraction using asingle adder. We can use another logic unit named EAClogic unit to implement this method.
4.2 EAC logic unit
Fig. 6 shows the way to combine the addition with the
subtraction. Actually, the effective subtraction needs twoaddition operations as shown in Fig. 5. With the help ofthe EAC logic unit, we can implement the addition andthe subtraction by only one addition operation. Fig. 4shows the design.
In Fig. 4, the input logic unit, sign logic unit and add/sub-logic unit are similar to those in Fig. 6. The four 32-bit adderblocks are all the 32-bit adder block shown in Fig. 2. In fact,the real addends of our EAC adder in Fig. 4 are two 128-bitbinary numbersx.xand yt. They are divided into four groupsas the inputs into the four 32-bit adder blocks. The four 32-
bit adder blocks output the group-propagated carriesP127:96, . . . , P31:0, the group-generated carries G127:96, . . . ,G31:0 which are used by the EAC logic unit; they alsooutput the conditional sums s0127:0, s1127:0 which are usedby the last sum selection unit. s0127:0 is the sum under thecondition that the incoming carry is 0 while s1127:0 is thesum under the condition that the incoming carry is 1.
In Fig. 4, the signals G127:96, P127:96, . . . , G31:0, P31:0 areused by both R logic unit and EAC logic unit. The R logicunit takes P31:0, os as the inputs and P
t31:0 as the output.
Pt31:0 is computed by
Pt31:0 =P31:0, os= 1
0, os= 0
(7)
The EAC logic unit takes the signals G127:96, P127:96, . . . , G31:0together with Pt31:0 as the inputs to calculate the incomingcarries into each group c0, c1, c2, c3. With the help of theabove logic units, the algorithm of EAC adder is as follows:
1. When x.s=y.s, we have os= 1, yt =y.y, Pt31:0 = P31:0.From the formulation of carry lookahead adder, we canobtain
c1 = G31:0 + P31:0cinc2 = G63:32 + P63:32G31:0 + P63:32P31:0cin
c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P31:0cin
c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0cin (8)
Following the first step ofFig. 5, we know thatx.x+y.y+ 1should be done and the outgoing carrycout should be used todecide whetherx.x is bigger thany.y or not. Thus, by theabove equations, assuming cin = cin1 = 1, cout can becomputed as
cout= c0 = G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0 (9)
Then, for the second addition which meansx.
x+y.
y+ cout,we take cout= c0 as the incoming carry. Using theformulations of carry lookahead adder again, we can obtaingroup carry signals as
c1 = G31:0 + P31:0cout= G31:0 + P31:0 {G127:96 + P127:96G95:64+ P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0}
=
G31:0+
P31:0G127:96+ P127:96P31:0G95:64 + P127:96P95:64P31:0G63:32+ P127:96P95:64P63:32P31:0
c2 = G63:32 + P63:32G31:0 + P63:32P31:0cout= G63:32 + P63:32G31:0+ P63:32P31:0G127:96 + P127:96P63:32P31:0G95:64+ P127:96P95:64P63:32P31:0
c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P31:0cout
= G95:64 + P95:64G63:32 + P95:64P63:32G31:0
+ P95:64P63:32P31:0G127:96+ P127:96P95:64P63:32P31:0
c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0cout
= G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P31:0
(10)
c
0, c
1, c
2, c
3 can be used to select the correct sumx.x+y.y+ cout= sum127:0. In the following, we will showhow the EAC logic unit completes the task mentioned above.
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 311
doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010
www.ietdl.org
7/28/2019 05494882
7/11
Definition 4.1 (EAC logic unit): The EAC logic unittakes the signals G127:96, P127:96, . . . , G31:0, P
t31:0 as the
inputs and c0, c1, c2, c3 as the outputs. The outputs aredefined as follows
c1 = G31:0 + Pt31:0G127:96 + Pt31:0P127:96G95:64+ Pt31:0P127:96P95:64G63:32 + P127:96P95:64P63:32P
t31:0
c2 = G63:32 + P63:32G31:0 + Pt31:0P63:32G127:96
+ Pt31:0P127:96P63:32G95:64 + P127:96P95:64P63:32Pt31:0
c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P
t31:0G127:96 + P127:96P95:64P63:32P
t31:0
c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32+ P127:96P95:64P63:32G31:0 + P127:96P95:64P63:32P
t31:0
(11)
As we know, when x.s=y.s, we have os= 1 andPt31:0 = P31:0. So, for EAC logic unit, the equations ofcalculating c0, c1, c2, c3 can be rewritten as
c1 = G31:0 + P31:0G127:96 + P31:0P127:96G95:64+ P31:0P127:96P95:64G63:32 + P127:96P95:64P63:32P31:0
c2 = G63:32 + P63:32G31:0 + P31:0P63:32G127:96+ P31:0P127:96P63:32G95:64 + P127:96P95:64P63:32P31:0
c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+
P95:64P63:32P31:0G127:96+
P127:96P95:64P63:32P31:0c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32
+ P127:96P95:64P63:32G31:0 + P127:96P95:64P63:32P31:0
(12)
In this case, it is easy to find that the equations of calculatingc0, c1, c2, c3 are equivalent to the formulations of computingc0, c
1, c
2, c
3 above. Therefore the end-around-logic unit canbe used to implement the subtraction dataflow shown inFig. 5 by only one addition. Furthermore, os and c0 can beused to select the correct sum
s.s from sum127:0 and sum127:0
according to the following rules:
Whenx.xy.y, we have os= 1, c0 = 1, cout= c0 = 1. Asa result, sum127:0 =x.x+y.y+ 1, and the sum is selected ass.s= sum127:0 =x.x+y.y+ 1.
When x.x,y.y, os= 1, c0 = cout= 0. we havesum127:0 =x.x+y.y+ 0. Then, the sum iss.s= sum127:0 =x.x+y.y+ 0.2. When x
.
s=y.
s, we should do the addition x.
x+y.
y.Taking the formulations of carry lookahead adder andassuming the incoming carry cin = 0, the group carries can
be calculated as follows
c1 = G31:0 + P31:0cin = G31:0c2 = G63:32 + P63:32G31:0 + P63:32P31:0cin= G63:32 + P63:32G31:0
c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0+ P95:64P63:32P31:0cin
= G95:64 + P95:64G63:32 + P95:64P63:32G31:0c0 = G127:96 + P127:96G95:64 + P127:96P95:64G63:32
+ P127:96P95:64P63:32G31:0 + P127:96P95:64P63:32P31:0cin= G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0
(13)
With cin = 0, c
1 , c
2 , c
3 , we can select the correct sum
x.x+y.y from the outputs s0127:0 and s1127:0 and c0 is theoutgoing carry.On the other hand, because ofx.s=y.s, we have os= 0,
yt =y.y, Pt31:0 = 0, the EAC logic units formulations canbe rewritten as follows
c1 = G31:0 + Pt31:0G127:96 + P
t31:0P127:96G95:64
+ Pt31:0P127:96P95:64G63:32 + P127:96P95:64P63:32Pt31:0
= G31:0 + 0 G127:96+ 0 P127:96G95:64 + 0 P127:96P95:64G63:32+ P
127:96
P95:64
P63:32
0
= G31:0c2 = G63:32 + P63:32G31:0
+ Pt31:0P63:32G127:96 + Pt31:0P127:96P63:32G95:64
+ P127:96P95:64P63:32Pt31:0
= G63:32 + P63:32G31:0c3 = G95:64 + P95:64G63:32 + P95:64P63:32G31:0
+ P95:64P63:32Pt31:0G127:96 + P127:96P95:64P63:32P
t31:0
= G95:64 + P95:64G63:32 + P95:64P63:32G31:0c0 = G127:96 + P127:96G95:64
+
P127:96P95:64G63:32+
P127:96P95:64P63:32G31:0+ P127:96P95:64P63:32P
t31:0
= G127:96 + P127:96G95:64+ P127:96P95:64G63:32 + P127:96P95:64P63:32G31:0
(14)
We can see that the equations calculating c0 , c
1 , c
2 , c
3 aresame to the equations calculating c0, c1, c2, c3. Therefore c1,c2, c3 can be used to select the correct sum. Here, the firstgroup is a special case. sum31:0 is not only controlled by c0,but also controlled by os
sum31:0 =s131:0, c0 ^ os= 1s031:0, others
(15)
312 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
7/28/2019 05494882
8/11
By this way, when x.s=y.s, whatever the value of c0 is, wealways have sum127:0 =x.x+y.y. The EAC logic unit canimplement the simple addition x.x+y.y correctly.Furthermore, os and c0 can also be used to select the correctsum in subtraction dataflow discussed above.
In this way, the end around carry logic unit can combinethe addition and subtraction correctly by doing onlyone addition operation. In paper [2], the formulation ofthe EAC adder is similar, but some details of thealgorithms were not explained, and the EAC logic unitis introduced as a part of the fused multiply-add unit.
This means it cannot do the addition independently.Our design given in Fig. 4 can perform the additionindependently. So, it is easy to verify the correctness ofthe adders algorithms.
5 Implementation and validationFrom the arithmetic algorithms discussed above, we knowthat for the 32-bit adder block in IBMs EAC adderdesign, the first level and the second level parallel prefixtree is a Kogge Stone tree. Comparing to Ladner Fischertree, the KoggeStone tree design is a better choice on
ASIC technology. Here, we try to find the best choice onFPGA technology. In this paper, our proposed EAC adderfollows all the key algorithms of IBMs design, theadditional logic units mainly are used to ensure that theEAC adder can work independently. Thus, it is not only
useful to implement and test the EAC adder easily, butalso useful as a reference to find a better design for theEAC adder used in fused multiply-add unit. We willimplement different parallel prefix trees architectureconfigurations in our EAC adder and report the simulationresults.
Knowles [7] has presented complete classes of regular fan-out prefix adders which are bounded at the extremes by theKoggeStone tree and LadnerFischer tree. For our study,using PFGA technology, we choose the regular parallelprefix trees of Knowless adder family and otherbasic parallel prefix trees to implement the first level 8-bit
parallel prefix tree as depicted in Fig. 3. These chosenparallel prefix trees are Kogge Stone; Ladner Fischer;Brent Kung; Han Carlson; Konwles [1, 1, 4]; Konwles[1, 2, 2]; Konwles [1, 1, 2].
Then, for the second level parallel prefix tress in Fig. 2, wealso choose Konwles [1, 1] and Konwles [1, 2] in Konwlessadder family to implement them, respectively. These adders
were selected because they span the design limits andintermediate cases in terms of area, depth of prefixnetwork, fan-out and interconnect count. The notionsintroduced in Section 3 are helpful to understand how
these parallel prefix trees work. However, we should changetheir regular implementation to ensure that they can workcorrectly in the EAC adder.
5.1 Parallel prefix tree design in EACadder
To implement a parallel prefix tree, we need half-adder tocalculate generated-carry and propagated-carry at each bitposition. Then, using these carry signals, we need some
other cells to compute group-generated carries and group-propagated carries. Fig. 7 shows some gate-level basiccells which calculate group-propagated carry Pi:j and group-generated carryGi:j in the parallel prefix trees intermediatestages. In Fig. 7, the quadrate cell calculates Pi:j and Gi:jsimultaneously whereas the triangular cell just calculatesGi:j. Therefore the circuit of the quadrate cell is morecomplex than that of the triangular cell.
With the help of these basic cells, the roughimplementation of BrentKung tree is shown in Fig. 8.
We use HAi (0 i 7) to denote Half adder. Here, we
do not take the buffers into account. Here, for a regularparallel prefix adder which does addition of two addends,
Figure 7 Basic cells in parallel prefix tree
Figure 8 Eight-bit BrentKung tree
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 313
doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010
www.ietdl.org
7/28/2019 05494882
9/11
we always assume that the incoming carry into this adder isc0 = 0. For two N-bit binary addends x= (xn1xn2, . . . ,x0), y= (yn1yn2, . . . ,y0), the formulations of computingcarry and sum at bit position i in parallel prefix tree areci = Gi1:0 _ (Pi1:0 ^ c0), si= Pi ci, where 0 in 1. Because c0 = 0, we have ci = Gi1:0 _ (Pi1:0 ^
c0) = Gi1:0. That is why we can use two different basiccells in Fig. 7 to build the regular Brent Kung tree inFig. 8. The idea is that sometimes only the signal Gi1:0 isneeded, therefore the triangular cell which is more simplecan be used to reduce the complexity. Vitoroulis [18]compared the performance and area for regular parallelprefix trees which are implemented on FPGA technology.But when the parallel prefix trees are implemented ascomponents of our EAC adder in Fig. 4, they cannot bedesigned in the regular way shown in Fig. 8. Both Gi:0 andPi:0 should be kept as the outputs for reuse in the nextstage. For example, if we want to use BrentKung tree as
the component in the EAC adder, which means the parallelprefix tree in Fig. 3 is implemented using BrentKung tree,
we can only use the quadrate cell to calculate the signals inthe intermediate stages. We must change the regular designof Brent Kung tree shown in Fig. 8. Fig. 9 shows therough architecture of the modified BrentKung tree adopted.
Therefore on FPGA technology, the properties of thedifferent parallel prefix trees such as area and performance
will be different from the results listed in Vitorouliss report.
As a result, if we implement different parallel prefix trees inour EAC adder, we should first change the implementationof the parallel prefix tree itself; then, we also should takeinto account the relationship between the parallel prefixtrees and the other parts of the EAC adder.
5.2 Experimental results
In this section, we present the simulation, synthesis andimplementation results. EAC adders were designed using
the various tree structures which are already discussed inthis paper. They are firstly coded in VHDL in twodifferent levels and then all 14 different architectureconfigurations are modeled in the Aldec Active HDLsimulation environment. The adder functionality wassuccessfully verified using 100 000 random test vectors.
After functional verification, all the 14 adder architectureswere implemented on a high performance Virtex II-PROXilinx FPGA (XC2VP100) chip in Xilinx ISE synthesiserenvironment. We measured the area of an implementeddesign in terms of the number of FPGA slices taken by theimplemented design, and the speed performance in termsof the longest signal path or critical path delay of the design(ns). The area and speed results are compared in Figs. 1013.
These results show that, we achieve minimum area whenusing the 32-bit Knowles [1, 1] tree and 8-bit LadnerFischer tree configuration and the maximum area when using
the 32-bit Knowles[1, 1] tree and 8-bit Knowles [1, 1, 2] treeconfiguration (Fig. 13), which is 18% larger.
By comparing the critical delay results of various EACadders in Fig. 10, we can find that the 32-bit Knowles[1, 2] tree and 8-bit HanCarlson tree configuration hasthe lowest delay; the 32-bit Knowles [1, 1] tree and 8-bit
Figure 9 Eight-bit BrentKung tree in EAC adder
Figure 10 Critical path delay, logic delay and route
delay (ns)
Figure 11 Logic delay (ns)
314 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
7/28/2019 05494882
10/11
LadnerFischer tree configuration has the maximum delaywhich is about 22.5% larger. As we know, the critical path
delay has two main components, the logic delay and therouting delay. It can be seen in Fig. 10 that the routingdelay for all adders is more than the logic delay, with verylittle variation. Here, the wiring (Routing) is automaticallychosen by the synthesiser tools. Sometimes it can beoptimised at the final phase of any design manually orusing other methods to decrease it. But sometimes it is
very hard to do this optimisation. Although the logic delayis always related to the routing delay, in Figs. 11 and 12 westill compare them separately. The results show the 32-bitKnowles [1, 2] tree and 8-bit HanCarlson treeconfiguration also has the lowest logic delay, but not the
lowest routing delay; the 32-bit Knowles [1, 1] tree and8-bit LadnerFischer tree configuration seems to have themaximum logic delay and the maximum routing delay.
Finally the 32-bit Knowles [1, 2] and 8-bit HanCarlsonconfiguration seems to be the best compromise between areaand speed. Even though the occupied area is about 3% largerthan the minimum, it is more than compensated by asignificant increase in terms of the speed.
It is also known that FPGA have built-in carry logic basedon fast-carry computations which outperforms parallel prefix
adders in both area and delay[18]. This is mainly because thebuilt-in carry logic in FPGA can use a high speed bus topropagate the carry. In our experiments, the logic delay of
built-in carry logic is about 13.7 ns (Fig. 11), longer thanthat of parallel prefix adders, which are about 10 ns.However, the routing delay of built-in carry logic is only3.4 ns. In contrast, the routing delay of BrentKung adder,
which is almost the minimum among all parallel prefixadders, is 11.8 ns. In summary, the total delay of built-in
carry logic is 17.1 ns, less than that of BrentKung adder,which is 21.9 ns. This result validates that built-in carrylogic is a better choice in FPGA than parallel prefix adder.However, in an EAC adder, we do not only use the sumsignals from the adder, but also need the group propagatedcarries and group generated carries, which can only beobtained from parallel prefix adders. That is to say, in orderto port the EAC adder to FPGA, the use of parallel prefixtree is still required. To achieve a better implementation,experiments over different parallel prefix trees are helpful tofind the optimal solution.
For the power consumption, the Xilink power estimationtool, XPOWER, gives very rough estimations. For allimplementations of the EAC adders the power dissipation
was estimated approximately 572 mW. We notice thatVitoroulis also did not list the power consumptions forregular parallel prefix trees [18]. Therefore we will keeplooking for better tools that can report precise powerdissipation and consider the power consumption as ametric in future direction. But right now, based thesimulation results we have, we may say KoggeStone treeis not a better choice as in ASIC technology. Compared toother parallel prefix trees, Kogge Stone implementationhas longer delay, bigger area and similar power consumption.
6 Conclusion
In this paper, we proposed a complete design of a binaryfloating-point EAC adder and explained the details of itsarithmetic algorithms. Our EAC adders algorithms mainlyfollow a 128-bit binary floating-point adder which isimplemented in the IBM POWER6 microprocessor.Compared to the IBMs design, our EAC adder can workindependently, which makes it easy to implement and test.Because there are few details of the EAC adders arithmeticalgorithms in todays literature, our paper can help
designers to understand this arithmetic unit well. Then, westudied the performance of parallel prefix treesimplemented in our EAC adder with FPGA technology.
After analysing the relationships between parallel prefixtrees and other parts of the EAC adder, we modified theimplementation of regular parallel prefix trees to ensure thatthey are able to be used within the EAC adder correctly.By comparing the areas and performances of 14 differentparallel prefix trees architecture configurations, we foundthat the 32-bit Knowles [1, 1] and 8-bit LadnerFischerconfiguration has the minimum area while the 32-bitKnowles [1, 2] and 8-bit HanCarlson configuration has
the minimum critical path delay. Although the occupiedarea is about 3% larger than the minimum, the 32-bitKnowles [1, 2] and 8-bit HanCarlson configuration
Figure 12 Routing delay (ns)
Figure 13 Number of slices
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 315
doi: 10.1049/iet-cdt.2009.0036 & The Institution of Engineering and Technology 2010
www.ietdl.org
7/28/2019 05494882
11/11
seems to be the best compromise between area and speed forthe FPGA implementation.
7 References
[1] CURRAN B., MCCREDIE B., SIQAL L., ET AL.: 4GHz+ low-latencyfixed-point and binary floating-point execution units for the
POWER 6 processor. Digest of 2006 IEEE Int. Solid-State
Circuits Conf., 2006, pp. 17281734
[2] SCHWARZ E.M.: Binary floating-point unit design,
in U . S. S . ( E D. ) : High performance energy efficient
microprocessor design (Springer, 2006), pp. 189208
[3] YU X.Y., FLEISCHER B., CHAN Y.H., ET AL.: A 5 GHz+ 128-bit
binary floating-point adder for the POWER 6 processor.
Proc. Int. Conf. 32nd European Solid-State Circuits, 2006,
pp. 166169
[4] LEOBANDUNG D.M.E., NAYAKAMA H., ET AL.: High performance
65 nm SOI technology with dual stress liner and low
capacitance sram cell. Digest of 2005 Symp. on VLSI
Technology, 2005
[5] KOGGE P.M., STONE H.S.: A parallel algorithm for the
efficient solution of a general class of recurrence
equations, IEEE Trans. Comput., 1973, 22, (8), pp. 786793
[6] LADNER R., FISCHER M.: Parallel prefix computation,
J. ACM, 1980, 27, (4), pp. 831838
[7] KNOWLES S.: A family of adders. Proc. 15th IEEE Symp.
on Computer Arithmetic, 2001, pp. 277281
[8] OKLOBDZIJA V.G., VILLEGER D.: Improving multiplier design
by using improved column compression tree and
optimized final adder in CMOS technology, IEEE Trans.
VLSI Syst., 1995, 3, (2)
[9] STELLING P., OKLOBDZIJA V.G.: Design strategies for
optimal hybrid final adders in a parallel multiplier, J. VLSI
Signal Process. (Special issue on VLSI Arithmetic), 1996,
14, (3)
[10] ZEYDEL B.R., OKLOBDZIJA V.G., MATHEW S., KRISHNAMURTHY R.K.,
B ORKAR S.: A 90 nm 1 GHz 22 mW 16 16-bit 2s
complement multiplier for wireless baseband. Proc. 2003
Symp. on VLSI Circuits, 2003
[11] SHEDLETSKY J.J.: Commenton on the sequential and
indeterminate behavior of an end-around-carry adder,
IEEE Trans. Comput., 1977, pp. 271271
[12] ZHANG X.Y., CHAN Y.H., MONTOYE R., ET AL.: A 270 ps 20 mW
108-bit end-around carry adder for multiply-add fused
floating point unit, J. Signal Process. Syst., 2009
[13] SKLANSKY J.: Conditional-sum addition logic, IRE Trans.
Electronic Comput., 1960, EC-9, pp. 226231
[14] BRENT R.P., KUNG H.T.: A regular layout for parallel adders,
IEEE Trans. Comput., 1982, C, (31), pp. 260264
[15] HAN T., CARLSON D.: Fast area-efficient VLSI adders. Proc.
Eighth Symp. Comp, 1987, pp. 4956
[16] ZIEGLER M.M., STAN M.R.: A unified design space for
regular parallel prefix adders. Proc. Design, Automation
and Test in Europe Conf. and Exhibition (DATE04), 2004,
pp. 13861387
[17] LIU J.H., ZHU Y., ZHU H.K., ET AL.: Optimum prefix adders in
a comprehensive area, timing and power design space.
Proc. 12th Conf. on Asia South Pacific Design Automation
(ASP-DAC07), 2007, pp. 609615
[18] VITOROULIS K., AI-KHALILI A.J.: Performance of parallel
prefix adders implemented with FPGA technology. IEEE
Northeast Workshop on Circuits and Systems, 2007,
pp. 498501
[19] KOREN I.: Computer arithmetic algorithms (A.K. Peters,
Natick, MA, 2002)
316 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306316
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org