15
Decimal Floating-Point Multiplication Mark A. Erle, Senior Member, IEEE, Brian J. Hickmann, Member, IEEE, and Michael J. Schulte, Senior Member, IEEE Abstract—Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. This paper presents the design of two decimal floating-point multipliers: one whose partial product accumulation strategy employs decimal carry-save addition and one that employs binary carry-save addition. The multiplier based on decimal carry-save addition favors a nonpipelined iterative implementation. The multiplier utilizing binary carry-save addition allows for an efficient pipelined implementation when latency and throughput are considered more important than area. Both designs comply with specifications for decimal multiplication given in the IEEE 754 Standard for Floating-Point Arithmetic (IEEE 754-2008). The multipliers extend previously published decimal fixed-point multipliers by adding several features, including exponent generation, sticky bit generation, shifting of the intermediate product, rounding, and exception detection and handling. Novel features of the multipliers include support for decimal floating-point numbers, on-the-fly generation of the sticky bit in the iterative design, early estimation of the shift amount, and efficient decimal rounding. Iterative and parallel decimal fixed-point and floating-point multipliers are compared in terms of their area, delay, latency, and throughput based on verified Verilog register-transfer-level models. Index Terms—Decimal multiplication, binary coded decimal, floating-point arithmetic, serial multiplication, parallel multiplication, pipelined multiplication. Ç 1 INTRODUCTION D UE to the importance of decimal arithmetic in commer- cial applications and the potential speedup achievable [1], [2], microprocessors supporting decimal floating-point (DFP) arithmetic are now available [3]. Further, specifica- tions for decimal arithmetic have been added to the updated IEEE Standard for Floating-Point Arithmetic [4], hereafter referred to as “IEEE 754-2008.” These specifications are more comprehensive than those detailed in the IEEE Standard for Radix-Independent Floating-Point Arithmetic [5]. They include formats for single, double, and quadruple precision DFP numbers and operations for double and quadruple precision DFP numbers. A fundamental operation in DFP arithmetic is multi- plication, which is integral to the decimal-dominant applications found in financial analysis, banking, tax calculation, currency conversion, insurance, and account- ing. In the research presented by Wang et al. [2] on six benchmarks running on platforms with software imple- mentations of decimal arithmetic, the DFP multiply opera- tion consumed from 3 percent to 33 percent of the execution time and required an average latency per function call of 634 cycles. Thus, there is opportunity for significant speedup in software applications running on platforms with hardware implementations of DFP multiplication. Over the years, several designs for fixed-point decimal multiplication have been proposed, including [6], [7], [8], and [9]. These designs iterate over the digits of the multiplier and, based on the value of the current digit, either successively add the multiplicand or a multiple of the multiplicand. The multiples are generated via costly lookup tables or developed using a previously generated subset of multiples. All of these designs are iterative, and none of them support DFP multiplication. Recently, two papers have been published describing parallel fixed-point decimal multiplication. Lang and Nannarelli’s parallel design [10] recodes each multiplier operand digit into two terms, f0; 5; 10g and f2; ... ; þ2g. In so doing, only the decimal double and quintuple of the multiplicand are required, each of which is obtained quickly as there is no carry propagation [11]. This recoding scheme leads to two partial products for every digit of the multiplier operand. The set of partial products is reduced to two via binary-coded decimal (BCD) 3:2 carry-save adders (CSAs). In Vazquez et al.’s work [12], a family of parallel decimal multipliers is presented based on the recoding of the multiplicand digits into BCD-4221. 1 This recoding enables the use of binary CSAs for the partial product accumulation by correcting the carry digit with decimal doubling. Since binary CSAs are used in the partial product accumulation step, the tree can also support binary multiplication. Aside from recently published papers by the authors [13], [14], there are only two previous papers presenting designs for DFP multiplication [15], [16]. The multiplier 902 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009 . M.A. Erle is with IBM, 6677 Sauterne Drive, Macungie, PA 18062. E-mail: [email protected]. . B.J. Hickmann is with Intel Corporation-Ronler Acres, 2501 NW 229th Ave. Mailstop RA2-406, Hillsboro, OR 97124. E-mail: [email protected]. . M.J. Schulte is with the Department of Electrical and Computer Engineering, University of Wisconsin-Madison, 1415 Engineering Drive, Madison, WI 53706. E-mail: [email protected]. Manuscript received 23 July 2007; revised 12 Mar. 2008; accepted 18 Sept. 2008; published online 5 Dec. 2008. Recommended for acceptance by P. Kornerup, P. Montuschi, J.-M. Muller, and E. Schwarz. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TC-2007-07-0372. Digital Object Identifier no. 10.1109/TC.2008.218. 1. In this paper, alternate decimal codes are presented as BCD-xxxx, where x is the weight of each respective binary bit. For example, 1001 2 equates to 4 þ 0 þ 0 þ 1 ¼ 5 in BCD-4221 and 8 þ 0 þ 0 þ 1 ¼ 9 in BCD-8421. 0018-9340/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Decimal Floating-Point Multiplication

  • Upload
    chndn

  • View
    464

  • Download
    6

Embed Size (px)

Citation preview

Page 1: Decimal Floating-Point Multiplication

Decimal Floating-Point MultiplicationMark A. Erle, Senior Member, IEEE, Brian J. Hickmann, Member, IEEE, and

Michael J. Schulte, Senior Member, IEEE

Abstract—Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation,

currency conversion, insurance, and accounting. This paper presents the design of two decimal floating-point multipliers: one whose

partial product accumulation strategy employs decimal carry-save addition and one that employs binary carry-save addition.

The multiplier based on decimal carry-save addition favors a nonpipelined iterative implementation. The multiplier utilizing binary

carry-save addition allows for an efficient pipelined implementation when latency and throughput are considered more important than

area. Both designs comply with specifications for decimal multiplication given in the IEEE 754 Standard for Floating-Point Arithmetic

(IEEE 754-2008). The multipliers extend previously published decimal fixed-point multipliers by adding several features, including

exponent generation, sticky bit generation, shifting of the intermediate product, rounding, and exception detection and handling. Novel

features of the multipliers include support for decimal floating-point numbers, on-the-fly generation of the sticky bit in the iterative

design, early estimation of the shift amount, and efficient decimal rounding. Iterative and parallel decimal fixed-point and floating-point

multipliers are compared in terms of their area, delay, latency, and throughput based on verified Verilog

register-transfer-level models.

Index Terms—Decimal multiplication, binary coded decimal, floating-point arithmetic, serial multiplication, parallel multiplication,

pipelined multiplication.

Ç

1 INTRODUCTION

DUE to the importance of decimal arithmetic in commer-cial applications and the potential speedup achievable

[1], [2], microprocessors supporting decimal floating-point(DFP) arithmetic are now available [3]. Further, specifica-tions for decimal arithmetic have been added to the updatedIEEE Standard for Floating-Point Arithmetic [4], hereafterreferred to as “IEEE 754-2008.” These specifications are morecomprehensive than those detailed in the IEEE Standard forRadix-Independent Floating-Point Arithmetic [5]. Theyinclude formats for single, double, and quadruple precisionDFP numbers and operations for double and quadrupleprecision DFP numbers.

A fundamental operation in DFP arithmetic is multi-plication, which is integral to the decimal-dominantapplications found in financial analysis, banking, taxcalculation, currency conversion, insurance, and account-ing. In the research presented by Wang et al. [2] on sixbenchmarks running on platforms with software imple-mentations of decimal arithmetic, the DFP multiply opera-tion consumed from 3 percent to 33 percent of the executiontime and required an average latency per function call of634 cycles. Thus, there is opportunity for significant

speedup in software applications running on platformswith hardware implementations of DFP multiplication.

Over the years, several designs for fixed-point decimalmultiplication have been proposed, including [6], [7], [8],and [9]. These designs iterate over the digits of themultiplier and, based on the value of the current digit,either successively add the multiplicand or a multiple of themultiplicand. The multiples are generated via costly lookuptables or developed using a previously generated subset ofmultiples. All of these designs are iterative, and none ofthem support DFP multiplication.

Recently, two papers have been published describingparallel fixed-point decimal multiplication. Lang andNannarelli’s parallel design [10] recodes each multiplieroperand digit into two terms, f0; 5; 10g and f�2; . . . ;þ2g. Inso doing, only the decimal double and quintuple of themultiplicand are required, each of which is obtainedquickly as there is no carry propagation [11]. This recodingscheme leads to two partial products for every digit of themultiplier operand. The set of partial products is reducedto two via binary-coded decimal (BCD) 3:2 carry-saveadders (CSAs). In Vazquez et al.’s work [12], a family ofparallel decimal multipliers is presented based on therecoding of the multiplicand digits into BCD-4221.1 Thisrecoding enables the use of binary CSAs for the partialproduct accumulation by correcting the carry digit withdecimal doubling. Since binary CSAs are used in the partialproduct accumulation step, the tree can also support binarymultiplication.

Aside from recently published papers by the authors[13], [14], there are only two previous papers presentingdesigns for DFP multiplication [15], [16]. The multiplier

902 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

. M.A. Erle is with IBM, 6677 Sauterne Drive, Macungie, PA 18062.E-mail: [email protected].

. B.J. Hickmann is with Intel Corporation-Ronler Acres, 2501 NW 229thAve. Mailstop RA2-406, Hillsboro, OR 97124.E-mail: [email protected].

. M.J. Schulte is with the Department of Electrical and ComputerEngineering, University of Wisconsin-Madison, 1415 Engineering Drive,Madison, WI 53706. E-mail: [email protected].

Manuscript received 23 July 2007; revised 12 Mar. 2008; accepted 18 Sept.2008; published online 5 Dec. 2008.Recommended for acceptance by P. Kornerup, P. Montuschi, J.-M. Muller,and E. Schwarz.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TC-2007-07-0372.Digital Object Identifier no. 10.1109/TC.2008.218.

1. In this paper, alternate decimal codes are presented as BCD-xxxx,where x is the weight of each respective binary bit. For example,10012 equates to 4þ 0þ 0þ 1 ¼ 5 in BCD-4221 and 8þ 0þ 0þ 1 ¼ 9 inBCD-8421.

0018-9340/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 2: Decimal Floating-Point Multiplication

designs by Cohen et al. [15] and Bolenger et al. [16] are digitserial and have long latencies. Furthermore, the results theyproduce do not comply with IEEE 754-2008.

This paper presents the designs of an iterative and aparallel DFP multiplier in compliance with IEEE 754-2008and a prevailing decimal arithmetic specification [17]. Ouriterative DFP multiplier uses the iterative fixed-pointdecimal multiplier presented in [9] for significand multi-plication. It features a reduced set of multiplicand multiples[11], decimal CSAs for the iterative accumulation of partialproducts [6], [8], and a variation of direct decimal addition[18] to implement decimal 4:2 compression. Our parallelDFP multiplier uses the radix-10 parallel fixed-pointdecimal multiplier proposed by Vazquez et al. [12] toperform significand multiplication. The radix-10 parallelmultiplier design, introduced in [12], features a reduced setof multiplicand multiples, binary carry-save addition forthe accumulation of partial products, and a recoding of themultiplier operand to reduce the number of partialproducts. Novel aspects of our proposed DFP multipliersinclude support for DFP arithmetic, early estimation of theshift amount, and efficient decimal rounding based on theobservation that the designs do not exhibit roundingoverflow. Additionally, the iterative design utilizes on-the-fly generation of the sticky bit to reduce area and delay.

This paper summarizes and extends the research pub-lished in [9], [13], and [14]. In particular, the paper presentsadditional references, more detailed descriptions of pre-vious research, qualitative and quantitative comparisons ofthe iterative and parallel DFP multipliers and the fixed-point significand multipliers they employ, and a proof thatrounding overflow cannot occur in our DFP multiplierdesigns. It also provides additional information on the IEEE754-2008 formats and the encodings used for IEEE 754-2008DFP numbers. The outline is given as follows: In Section 2,background information on DFP multiplication is pre-sented. Next, Section 3 contains a flowchart of the iterativemultiplier’s algorithm and descriptions of its major designcomponents. A flowchart of the parallel multiplier’salgorithm, descriptions of its unique components andfunctions, and an implementation are presented in Section 4.Results from synthesizing both DFP designs and their fixed-point counterparts are detailed in Section 5, along with adiscussion of the trade-offs between the designs. InSection 6, a summary of the paper is presented.

Throughout this paper, uppercase variables denotemultiple-digit words, lowercase variables with subscriptsdenote digits, and lowercase variables with subscripts andindices denote bits. Thus, ai corresponds to digit i ofoperand A, and ai½j� corresponds to bit j in digit i. Squarebrackets are not needed when the lowercase variablerepresents a binary number. Uppercase variables withsubscripts denote multiple-digit words that are part of aniterative equation, and uppercase variables with subscriptsand indices denote digits. Superscripts are used to differ-entiate various forms of the same variable. Bits and digitsare indexed from most significant to least significant,starting with index zero. A subscript next to a constant orstring indicates the base. As for terminology, the precision ofa number is the maximum number of digits able to be

represented in a given fixed-width format, the number ofsignificant digits is the number of digits from the mostsignificant nonzero digit to the least significant digit (LSD),inclusive, and the number of essential digits is the number ofdigits between the most significant nonzero digit and theleast significant nonzero digit, inclusive, respectively.Finally, the symbols _ and ^ are used for logical OR andAND, respectively. For concatenation, variables are commaseparated and enclosed in braces, e.g., fA;Bg. Dependingon the context, this grouping is also used to indicate a set.

2 BACKGROUND OF DFP MULTIPLICATION

To maintain consistency between DFP and binary floating-point (BFP) formats, IEEE 754-2008 defines the parametersfor various format widths for both DFP and BFP numbers ina slightly modified scientific notation form. The modifica-tion is the radix point appearing to the right of the mostsignificant digit (MSD) or most significant bit (MSB),respectively, thereby restricting the magnitude of thesignificands to less than their radix. However, it isequivalent and more in line with elementary arithmetic toexpress DFP numbers in the following manner:

D ¼ �1s � C � 10E�bias; ð1Þ

where s is the sign bit, C is the nonnegative integersignificand, and E is the biased nonnegative integerexponent [4]. Thus, the biased exponent variable E usedin this paper relates to IEEE 754-2008’s e in the followingway: E ¼ e� ðp� 1Þ þ bias.

Three fixed-width interchange formats for DFP numbersare specified in IEEE 754-2008: decimal32, decimal64, anddecimal128 bits. For each format width, there is a signfield ðsÞ, a combination field ðGÞ, and a trailing significandfield ðT Þ. The number of bits in each format dedicated to s,G, and T are provided in Table 1, along with each format’sIEEE 754-2008 parameters.

Both the combination and trailing significand fields areencoded to maximize the number of representable valuesand diagnostic information. The combination field isencoded to indicate if the representation is a finite number,an infinite number, or a nonnumber (i.e., Not a Number orNaN). It also contains the exponent and the MSD of the

ERLE ET AL.: DECIMAL FLOATING-POINT MULTIPLICATION 903

TABLE 1DFP Format Parameters

Note: Length values are in bits, except where otherwise indicated.

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 3: Decimal Floating-Point Multiplication

significand when the operand represents a finite number.The combination field is wþ 5 bits, and the biased exponentis wþ 2 bits, where w is 6, 8, and 12, for the threeinterchange formats. The trailing significand field is amultiple of 10 bits and is either encoded via the DenselyPacked Decimal (DPD) algorithm [19] or as an unsignedbinary integer [4]. When encoded via DPD, each 10-bitsubstring represents three decimal digits. The DPD decod-ing and encoding scheme are found in IEEE 754-2008.

The nonnormalized significand leads to some distinctdifferences between DFP and BFP multiplication. First, ifthe number of significant digits in the unrounded productexceeds the format’s precision p, then this intermediateproduct may need to be left shifted prior to rounding.Second, if an intermediate product contains p� i essentialdigits, then there exists i equivalent representations of thevalue subject to the available exponent range. For example,if p equals 5 and the operation is 32� 1015 multiplied by70�1015, then possible results are 22;400�1029, 2;240�1030,or 224� 1031.

Because of the possibility of multiple representations ofthe same value, IEEE 754-2008 introduces the concept of apreferred exponent. The preferred exponent PE, drawnfrom elementary arithmetic, is based on the operation andthe exponent(s) of the operand(s). For multiplication, thepreferred exponent, prior to any rounding or exceptions, is

PE ¼ EA þ EB � bias: ð2Þ

For example, the product ofA ¼ 320� 10�2 ðEA ¼ �2þ 101Þmultiplied by B¼70�10�2 ðEB ¼ �2þ101Þ is P ¼ 22;400�10�4 ðPE ¼ �4þ 101Þ.

The DFP multipliers presented in this paper assumesthat the operands are stored in the decimal64 format withthe DPD encoding. The decimal64 format is comprised ofa 1-bit sign, a 13-bit combination field, and a 50-bittrailing significand field, which, after decoding, yields a10-bit binary exponent and a 16-digit decimal significand.The choice of precision, exponent base and range, andsignificand representation and encoding is examined in[20] and, to a lesser extent, in [4].

3 ITERATIVE DFP MULTIPLIER

The iterative DFP multiplier design presented in this paperis an extension of a decimal fixed-point multiplier designpublished in [9]. A flowchart-style drawing of the iterativeDFP multiplication algorithm is shown in Fig. 1, with thesteps of the decimal fixed-point multiplier from [9]surrounded by a gray box. The operation begins with thereading of the operands from a register file or memory anddecoding from their DPD encoding. Next, the double,quadruple, and quintuple of the multiplicand are generatedin the data path portion of the design. Then, in an iterativemanner starting with the LSD of the multiplier significand,each digit is used to select two multiples to add together toyield a respective partial product. The partial product isthen added with the previous iteration’s accumulatedproduct, and the accumulated product is shifted one digitto the right. After all the multiplier digits have beenprocessed, the sum of all the partial products, called the

intermediate product, is stored in a register twice as long as

each input significand. All the additions in the iterative

portion of the algorithm yield intermediate results in

decimal carry-save form.In parallel with the generation of the multiples and the

accumulation of partial products, the significands areexamined to determine their leading zero counts LZA andLZB, the exponents are examined to determine theintermediate exponents IEIP and IESIP , and the signs areXORed to determine the product sign sP . Based on theleading zero counts and the intermediate exponent, twovital control values are generated: a shift-left amount SLAand a sticky counter SC. The shift-left amount is needed toproperly align the intermediate product IP prior to round-ing. The sticky counter is needed to generate the stickybit sb created on the fly during the accumulation of partialproducts. The intermediate product is then shifted based onthe shift-left amount to produce the shifted intermediateproduct SIP . Using the operands’ combination fields, theintermediate exponent, and information from the shift andround steps, a determination is made as to whether anexception needs to be signaled and corrective action taken.

904 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

Fig. 1. Flowchart of the iterative DFP multiplier design using decimal

CSAs [13].

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 4: Decimal Floating-Point Multiplication

At the end of the iterative accumulation of partialproducts, the intermediate product is in the 2p-digitintermediate product register. Ultimately, a p-digit roundedproduct needs to be delivered. Since the shifted intermedi-ate product is in carry-save form, an add step is necessary toproduce a nonredundant product. After rounding theshifted intermediate product, the product exponent EP ,final product significand CP , and product combination fieldare generated.2 Finally, these values, along with the productsign, are used to produce the final result that is DPDencoded and written to a register file or memory.

Fig. 2 depicts the top portion of the DFP multiplierdesign. It is provided to aid in visualizing the manner inwhich partial products enter the intermediate productregister for accumulation, the location of the decimal pointin the data path, and the generation of the sticky bit. Withthe exception of the sticky bit generation shown in themagnified inset, this top portion of the multiplier design isthe decimal fixed-point multiplier design described in [9].

A critical design choice is the location of the decimalpoint in the data path as this dictates the direction theintermediate product may need to be shifted and thelocation and implementation of the rounding logic. For thisdesign, the location is chosen to be exactly in the middle ofthe intermediate product register throughout the data path,splitting the intermediate product into a more significanttruncated product TIP and a fractional product FRP . Thechosen location of the decimal point allows the intermediateproduct to only require a left shift to produce a roundedproduct (except during gradual underflow, if supported).

In the remainder of this section, the primary componentsnecessary to perform DFP multiplication are described indetail.

3.1 Partial Product Generation, Selection, andAccumulation

In this design, a reduced set of multiplicand multiples aregenerated for use in creating the partial products. Thisfunction takes place in the “Secondary Multiple Genera-tion” block in Fig. 2. The set f1CA; 2CA; 4CA; 5CAg, whereCA represents the multiplicand significand, is chosen as the

ERLE ET AL.: DECIMAL FLOATING-POINT MULTIPLICATION 905

2. In actuality, the leading digit of CP and the leading two bits of EP arecontained in the combination field.

Fig. 2. Top portion of the iterative DFP multiplier design.

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 5: Decimal Floating-Point Multiplication

multiples can be generated without carry propagation andall the remaining multiplicand multiples can be generatedby adding only two from the set [9]. Neither the double norquintuple propagate a carry beyond the next moresignificant digit, so only a few logic gate delays arerequired. Further, due to the simplicity of generating thedouble, the quadruple can be generated via the instantia-tion of a second doubler. The delay of the addition todevelop the remaining multiples is reduced by usingdecimal CSAs, at the expense of producing the multiplesin a redundant form. Dealing with this redundancy isaddressed in the next section on partial product generation.

Once all the secondary multiples are available, themultiplier operand’s digits are examined from the LSD tothe MSD to select the appropriate multiples. Control signalsselect multiples from the set f1CA; 2CA; 4CA; 5CAg and steerthem into the decimal CSAs (see the “Decimal (3:2)Counter” block in Fig. 2). The output of the CSAs is thedecimal carry-save partial products PPC and PPS to beadded to the intermediate products TIPCi and TIPSi. Thetemporary intermediate product is so named as the selectedpartial products are accumulated, one at a time, until all thedigits of the multiplier operand have been examined.

For the same reason CSAs are preferred over carry-propagate adders (CPAs), decimal (4:2) compressors arechosen for the accumulation of the partial products. Adecimal 4:2 structure is used as the temporary intermediateproduct is a redundant value and the incoming partialproduct to be added is a redundant value. The compressorsare located in the “Decimal (4:2) Compressor” block inFig. 2. Two observations are made regarding the shifting ofthe intermediate product with each iteration. First, theintermediate product IP is 2p digits long after all theiterations have completed. Second, the less significant halfof IP is in a nonredundant form, since each iterationproduces one product digit.

3.2 Intermediate Exponent Generation

At the end of partial product accumulation, p digits ofthe intermediate product are to the right of the decimalpoint, in the FRP . Thus, the intermediate exponent of theintermediate product IEIP is the preferred exponentincreased by p:

IEIP ¼EA þ EB � biasþ p¼PE þ p:

ð3Þ

After left shifting the intermediate product as part ofpreparing a p-digit final product, the intermediate exponentis decreased by the shift-left amount SLA (described in thenext section). The exponent associated with this shiftedintermediate product IESIP is calculated as follows:

IESIP ¼ IEIP � SLA: ð4Þ

The shifted intermediate product is then rounded, and the

associated exponent is named the intermediate exponent of

the rounded intermediate product IERIP . IERIP is one less

than IESIP or equal to IESIP depending on whether or not

a corrective left shift of one digit occurs during rounding.

However, the product exponent may differ from IERIP due

to an exception.

3.3 Intermediate Product Shifting

The intermediate product may need to be shifted to achievethe preferred exponent or to bring the product exponentinto range. Calculating the shift amount is dependent upon,among other things, the number of leading zeros in theintermediate product. However, instead of waiting until theintermediate product is generated to count the leadingzeros, the latency of the multiplication is reduced bydetermining a shift amount based on the leading zeros inboth the multiplicand and multiplier significands. With thisapproach, the precalculated shift amount may be off by onesince the number of significant digits in the final productmay be one less than the sum of the significant digits ofeach significand. Thus, the product may need to be leftshifted by one additional digit at some point after the initialshift.

Except when IEIP < Emin, the shift is always to the left.3

This is because each partial product is added to the previousaccumulated product with its LSD one digit to the left of thedecimal point. With an estimate of the significance of theintermediate product based on the significance of eachsignificand SIP ¼ SA þ SB, the shift-left amount is deter-mined as follows: If SIP > p, then one or more leading zerosof the intermediate product may need to be shifted off to theleft to maximize the significance of the result. However, ifSIP � p, then the entire product resides solely in the lowerhalf of the intermediate product register (assuming that allthe multiplier significand digits have been processed). In thelatter case, the less significant half of the intermediateproduct register can be placed into the upper half, by leftshifting the intermediate product by p digits. These twosituations lead to the following equation for the shift-leftamount SLA:

SLA ¼min ð2 � pÞ � ðSA þ SBÞ; p� �

¼min ð2 � pÞ � ðp� LZAÞ þ ðp� LZBÞ� �

; p� �

¼minðLZA þ LZB; pÞ;ð5Þ

where LZA and LZB are leading zero counts of thesignificands CA and CB, respectively.

In the event that the actual significance of the inter-mediate product is one digit less than the estimatedsignificance, it may be necessary to left shift the inter-mediate product one more digit after the initial left shift. Forthis reason, a guard digit is maintained and used in amanner analogous to that of BFP multiplication. Thehandling of the final left shift by one digit occurs in therounding portion of the algorithm and is described inSection 3.6.

3.4 Sticky Determination

After left shifting, any nonzero digits in the less significanthalf of the intermediate product register must be evaluatedin the context of the rounding mode to determine ifrounding up is necessary. In the event that the correctiveleft shift is performed and the guard digit is shifted into theLSD position of the more significant half of the inter-mediate product register, the next digit must be maintained

906 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

3. This assumes that every multiplier digit is processed during the partialproduct accumulation portion of the multiplication algorithm.

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 6: Decimal Floating-Point Multiplication

to determine if the remaining digits are less than one halfof the Unit in the Last Place (ULP), exactly one-half ULP, orgreater than one-half ULP. Since this digit in the next lesssignificant position to the guard digit is critical to round-ing, it is called the round digit, which is analogous to theround bit in binary multiplication. The bits of the digits tothe right of the round digit are all logically ORed toproduce a sticky bit.

The sticky bit is generated on the fly as in each iteration,a nonredundant digit enters the MSD of the less significanthalf of the intermediate product register. To determinewhen a digit being right shifted from the round digitposition to the next less significant digit position should beincluded in the sticky bit generation, a counter is used. Thestarting value of this counter is initialized just prior to thefirst partial product entering the intermediate productregister and is cleared between multiplications. Wheneverthe counter value is greater than zero, the digit being shiftedout of the round digit position is ORed with the previoussticky bit value, which is cleared between operations. Thesticky logic is shown in the inset in Fig. 2. The initial valueof the sticky counter SC is the significance of theintermediate product minus the format’s precision, unlessthis difference yields a negative number. Thus

SC ¼maxð0; SIP � pÞ¼max 0; ðp� LZAÞ þ ðp� LZBÞ � p

� �

¼max 0; p� ðLZA þ LZBÞ� �

:

ð6Þ

3.5 Exception Handling

There are four exceptions that may be signaled duringmultiplication: invalid operation, inexact, overflow, andunderflow. The invalid operation exception is signaledwhen either operand is a signaling NaN or when zero andinfinity are multiplied. The default handling of the invalidoperation exception involves signaling the exception andproducing a quiet NaN for the result. If only one operand is asignaling NaN, then the quiet NaN result is created from thesignaling NaN.

The inexact exception is signaled when the roundedresult of an operation differs from what would have beencomputed were both exponent range and precision un-bounded. In this situation, the rounded or overflowed resultis returned.

The overflow exception is signaled when a result’smagnitude exceeds the largest finite number. The detection

is accomplished after rounding by examining the computed

result as though the exponent range is unlimited. Default

overflow handling, as specified in IEEE 754-2008, involves

the selection of either the largest normal number or

canonical infinity and signaling the inexact exception.The underflow exception is signaled when a result is

both tiny and inexact. Tininess is when the result’s

magnitude is between zero and the smallest normal

number, exclusive. The following examples illustrate

underflow using numbers in the decimal32 format, which

has 1;000;000� 10Emin�bias as its smallest normal number.When 7;777;000� 10Emin�bias is multiplied by

9;000;000� 10bias�8;

the intermediate product is

6;999;300:0000000� 10Emin�bias�1:

However, this exponent cannot be represented. Instead of

abruptly converting this number to zero, a subnormal

number is produced by shifting the significand to the right

by one digit position and increasing the exponent by one to

achieve the minimum exponent. Thus, the product sig-

nificand is 0;699;930� 10Emin�bias. By reducing the precision

in this manner, underflow occurs gradually. In this example,

the shifting to the right of the significand did not result in

the loss of any nonzero digits. Thus, the result is exact, albeit

subnormal, and the underflow exception is not raised.

Alternatively, consider when 7;777;000�10Emin�bias is multi-

plied by 9;000;000� 10bias�13 to yield an intermediate

product of 6;999;300:0000000� 10Emin�bias�6. After right-

shifting to achieve the minimum exponent and rounding,4

the result is 0000007� 10Emin�bias. Since one or more

nonzero digits are “lost” to rounding, the result is both tiny

and inexact, and the underflow exception is signaled.

3.6 Rounding

Rounding is required when all the essential digits of the

intermediate product cannot be placed into the product

significand or when overflow or underflow occurs. The

description of each rounding mode required by IEEE 754-

2008 and its associated condition(s) are listed in Table 2,

where TPþ0 represents the more significant half of the

ERLE ET AL.: DECIMAL FLOATING-POINT MULTIPLICATION 907

4. The roundTiesToEven rounding mode is used.

TABLE 2Rounding Mode, Round-Up Conditions, and Product Override for Overflow

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 7: Decimal Floating-Point Multiplication

intermediate product IP . The default rounding mode islanguage defined but is encouraged to be roundTiesToEven.

In the case of rounding based solely on the number ofessential digits, rounding is accomplished by selectingeither the shifted intermediate product truncated to p digitsor its incremented value. In order to determine which valueis to be selected, the following are needed: the roundingmode, the product’s sign, and the shifted intermediateproduct (including a guard digit g, round digit r, and stickybit sb).

An adder is used to produce the truncated intermediateproduct TPþ0 and its incremented value TPþ1 in non-redundant form. The adder must be able to add a one intoeither its LSD position or its guard position. The twooptions for the position to add a one are necessary as theestimate of the shift-left amount may be off by one, in whichcase a corrective left shift is required. Though this mayappear to require more than one adder, both situations canbe supported by using a single compound adder. Theinputs to the adder are the data in the p MSD positions ofthe shifted intermediate product SIP ½0 : p� 1�. To under-stand why it is sufficient to use only one p-digit compoundadder, consider the four possibilities of adding a zero or aone into the LSD or guard digit positions. Clearly, addinga zero into the guard digit position is the same as adding azero into the LSD position (so long as the original guarddigit is concatenated). The remaining two possibilities arerelated in the following way. If a one is added into theguard digit position and a carry occurs out of the guarddigit position (i.e., g ¼¼ 9), then this is equivalent to addinga one into the LSD position and changing g to zero.Conversely, if a carry does not occur out of the guard digitposition (i.e., g < 9), then this is equivalent to adding a zerointo the LSD position and concatenating the incrementedguard digit.

Before presenting the rounding scheme employing thecompound adder, the following simplification is offered.This design does not need to contend with roundingoverflow. Rounding overflow is when the truncatedintermediate product TP is incremented due to roundingand a carry out of the MSD position occurs. A proof thatrounding overflow cannot occur is provided in theAppendix.

The use of a single compound adder to generate both ap-digit significand and its incremented value, along withthe guarantee of no postrounding normalization, allows asimple and efficient rounding scheme to be developed thatis unique from recent binary rounding schemes such asthose presented and referenced in [21].

First, an indicator grsb is set whenever there are nonzerodigits to the right of the LSD position of the shiftedintermediate product. That is, grsb ¼ ðg > 0Þ _ ðr > 0Þ _ sb.If there is a leading zero in the compound adder’s outputs,this indicator is used in determining if a corrective left shiftis needed. The corrective left shift does not happen whenround up is to occur and the first p digits of the shiftedintermediate product are zero followed by p� 1 nines. Inthis case, a round up produces a carry into the MSDposition, and the corrective left shift must be preempted.Fortunately, this unique case can be readily detected. It isthe only situation in which the MSD of TPþ0 is zero and theMSD of TPþ1 is nonzero.

Next, for the given rounding mode, two round-up valuesrucls¼¼0 and rucls¼¼1 are computed for the cases of nocorrective left shift by one and a corrective left shift by one,respectively. The computations are based on the shiftedintermediate product and the round-up condition(s) inTable 2. The difference between the two round-up values isthat for the case of a corrective left shift, the guard digit istreated as the LSD, and the round digit is treated as theguard digit. As an example, if the rounding mode isroundTowardZero, both round-up values are zero. Asanother example, if the rounding mode is roundTiesTo-Away and the LSD, guard, round, and sticky values are 0, 5,0, and 0, then rucls¼¼0 ¼ 1, and rucls¼¼1 ¼ 0.

Along with this, the guard digit g is used to producegp1 ¼ ðgþ 1Þmod 10, and an indicator g9 is sent when g ¼ 9.The value gp1 is needed during a corrective left shift whenround up is performed.

Using TPþ0, TPþ1, rucls¼¼0, rucls¼¼1, grsb, g, gp1, and g9,the algorithm presented in Fig. 3 realizes rounding for theDFP multiplier design. The algorithm is presented as threedistinct cases involving the MSDs of the two conditionalsums TPþ0 and TPþ1. The fourth case, when TPþ0 has zeroin its MSD and TPþ1 has a nonzero digit in its MSD, cannothappen.

Though the rounding scheme in Fig. 3 may appearcomplex, the choice is simply between TPþ0, TPþ1, orthese values left shifted by one digit with either g or gþ 1concatenated. For those cases in which a left-shifted form

908 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

Fig. 3. DFP multiply rounding scheme.

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 8: Decimal Floating-Point Multiplication

of a conditional sum is chosen, the intermediate exponentof the shifted intermediate product is decremented.Alternative rounding algorithms such as injection-basedrounding [22], [23] were explored, but none offered adefinitive advantage.

In the case of rounding due to overflow, the product isrounded according to the rightmost column in Table 2. Thetable describes the product to be generated for eachrounding mode under default exception handling asspecified in IEEE 754-2008. Note that overflow and under-flow may be avoided in some operations by increasing ordecreasing SLA, respectively. Any change to SLA mustaccordingly change the intermediate product and the stickycounter. The next section describes how the components inthe previous sections are combined to realize a DFPmultiplier.

3.7 Implementation

Fig. 4 shows the design components from several of theprevious sections together to complete iterative DFPmultiplication after all the partial products have beenaccumulated. The block-level drawing is of the bottom datapath portion of the p-digit iterative DFP multiplier design,beginning with the 2p-digit intermediate product register

and a sticky bit that was generated on the fly (seeSection 3.4). The top data path portion of the design,ending with the same intermediate product register, isshown in Fig. 2. Not shown in either design drawing is thecontrol logic, which is where the intermediate exponents,the sticky counter, the shift-left amount, and the roundingcontrol are calculated. These calculations are not timingcritical in an iterative multiplier.

Referring again to Fig. 4, the first step is to shift the 2p-digitintermediate product based on the shift-left amount SLA(described in Section 3.3) and store the ðpþ 2Þ-digit output inthe shifted intermediate product register. The two additionaldigits are needed for the guard and round digits. Then, insupport of the rounding scheme presented in Section 3.6, acompound adder receives the data stored in the shiftedintermediate product register. Since the data are either innonredundant form or in sum-and-carry form (four sum bitsand one carry bit per digit position), a unique decimalcompound adder is needed. For the sum portion of theaddition, ðsumi þ carryi þ 0Þ mod 10 and ðsumi þ carryi þ1Þ mod 10 are generated for each digit position i. For the carryportion of the addition, a digit generate equal to si½0� ^ si½3� ^ci and digit propagate equal to si½0� ^ ðsi½3� _ ciÞ are pro-duced. The digit propagate and generate are then fed into a

ERLE ET AL.: DECIMAL FLOATING-POINT MULTIPLICATION 909

Fig. 4. Bottom portion of the iterative DFP multiplier design.

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 9: Decimal Floating-Point Multiplication

carry network that produces the carries to select theappropriate digits to yield TPþ0 and TPþ1. The compoundadder is only p digits long, as described in Section 3.6. Oncethe two compound adder outputs TPþ0 and TPþ1 areavailable in registers, the rounding logic produces therounded intermediate product RIP based on the roundingscheme in Fig. 3. Completion of the multiply operationinvolves selecting between RIP and a special value such asNaN or infinity and producing the final result in DPD-encoded form.

Notable implementation choices include leveraging theleading zero counts of the operands’ significands, passingNaNs through the data flow with minimal overhead, andhandling gradual underflow via minor modification to thecontrol logic. Equations (5), (6) and, indirectly, (4) use theleading zero counts of the significands. This is intentional asthe determination of leading zeros is a common function infloating-point units [24]. Once each digit is identified as zeroor nonzero, the generation of the leading zero count is thesame as that for a BFP significand. As the accumulation ofpartial products is iterative, a single leading zero counter isused to determine successively the leading zero counts ofCA

and CB. If an operand is NaN, that operand’s NaN payloadis used when forming the result. Instead of supportingalternative paths through the data flow, the control logicpasses CA through the data flow by multiplying it by 1. Ifoperand B is NaN, CB is held in the less significant portionof the intermediate product register while the control logicoverrides the shift-left amount such that CB is left shiftedinto the more significant half of the shifted intermediateproduct register. As for gradual underflow, the control logicextends the iterative partial product accumulation portion ofthe algorithm and successively adds partial products equalto zero such that the accumulated partial product is rightshifted until IEIP increases to Emin or all nonzero digits areshifted into the sticky bit. This support of gradual underflowextends the latency of the multiplier from pþ 9 ¼ 25 cyclesto a maximum latency of 25þ ðpþ 2Þ ¼ 43 cycles whenIEIP � Emin� bias� ðpþ 2Þ.

4 PARALLEL DFP MULTIPLIER

A flowchart-style drawing of the parallel DFP multiplica-tion algorithm is shown in Fig. 5, with the steps of thedecimal fixed-point multiplier surrounded by a gray box.The fixed-point multiplier in this implementation isfundamentally the same as the radix-10 fixed-point decimalmultiplier in [12], except that the final adder uses a Kogge-Stone carry network. As with the iterative multiplier, theoperation begins with the reading and decoding of theoperands. However, after the initial decoding is performed,the multiplicand is recoded from BCD-8421 into BCD-4221,while the multiplier is recoded into f�5; . . . ;þ5g. For thisdesign, the recoding of the p-digit multiplier operand intothe radix-10 digit set f�5; . . . ;þ5g is chosen as thisapproach yields only pþ 1 partial products, as opposed to2p partial products with the alternative multiplier recodingschemes presented in [12]. Even though the chosen multi-plier recoding scheme necessitates the use of a p-digitdecimal CPA to produce the multiplicand triple, thisaddition can be accomplished with less hardware and in asimilar delay as it takes to reduce 2p partial products toroughly ðpþ 1Þ partial products.

Once the multiplicand is in BCD-4221 form, the double,triple, quadruple, and quintuple of the multiplicand aregenerated in the data path portion of the design. From thismultiple set f1CA; 2CA; 3CA; 4CA; 5CAg, all the partialproducts are formed. The pþ 1 partial products are selectedbased on each of the digits in the recoded multiplieroperand and are then presented in parallel to the partialproduct accumulation tree. After all the partial products areaccumulated into carry and sum vectors, a CPA is used toproduce a nonredundant intermediate result equal in lengthto the sum of the operands’ digits.

In parallel with the partial product accumulation, theshift-left amount SLA and intermediate exponent of theintermediate product IEIP are calculated for shiftingthe intermediate product IP to fit into p digits of precision.These calculations are performed in the same manner as thatdescribed for the iterative multiplier, namely, using theoperands’ leading zero counts in order to estimate thenumber of significant digits in the result. In addition tothe SLA and IEIP values, the sign bit of the final result iscalculated, and exception conditions are detected.

910 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

Fig. 5. Flowchart of the parallel DFP multiplier design [14].

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 10: Decimal Floating-Point Multiplication

The IP emerging from the partial product accumulationtree is then left shifted by the SLA amount, forming theshifted intermediate product SIP . In the iterative design,when the intermediate exponent of the intermediateproduct is less than Emin, the control logic allows theiterative portion of the algorithm to continue, right shiftingIP in an effort to bring the IEIP into range. Thus, gradualunderflow is supported in hardware with no change to thedata-flow circuitry. In the parallel design, however,supporting gradual underflow requires a modification tothe data flow in the form of expanding the function of theleft shifter to both a left and right shifter.

Next, the sticky bit is produced from the fractionalproduct FRP , which resides in the less significant half ofthe 2p-digit shifted intermediate product register. Inparallel, the truncated product TP is incremented to allowthe rounding logic to select between TPþ0 and TPþ1,which, as described previously, is sufficient to support allrounding modes (see Section 3.6).

Finally, the rounding and exception logic, based on therounding mode and exception conditions, selects betweenTPþ0, TPþ1, and special-case values to produce therounded intermediate product RIP . RIP is then encodedin DPD and put in IEEE 754-2008 format with theappropriate flags set to produce the final product FP . Inthe following sections, those components and functionsdistinct from the iterative multiplier design are described inmore detail. Fig. 7 depicts the parallel DFP multiplierdesign.

4.1 Partial Product Generation, Selection, andAccumulation

By recoding the multiplier operand into the rangef�5; . . . ;þ5g, only four multiplicand multiples need begenerated, f2CA; 3CA; 4CA; 5CAg, where CA refers to themultiplicand. Along with the complement of these multi-ples and the original multiplicand, any partial product inthe set f�5CA; . . . ;þ5CAg can be developed [12]. Asexplained in Section 3.1, the double, quadruple, andquintuple can be readily generated without carry propaga-tion, while the triple requires a carry-propagate addition.This function is shown in the “Multiplicand MultipleGeneration” block in Fig. 7.

The multiplier operand is effectively recoded to producecontrol signals directly from the BCD-8421 digits. Thesesignals select a multiple from the previously generatedmultiple set f1CA; . . . ; 5CAg and selectively complement thechosen multiple. Since the BCD-4221 code is a self-complementing code, the complement of the multiple isachieved via a bank of XOR gates controlled by a singlecomplement signal. This scheme, from [12], enables theparallel selection of pþ 1 multiples to become the partialproducts (see the “Multiple Selection” block in Fig. 7). Thereason pþ 1 partial products are needed, as opposed to onepartial product for each of the p multiplier operand digits, isbecause the “recoding” of each digit greater than fiveresults in the next more significant digit being incrementedby one and the current digit being complemented. Forexample, 6 ¼ 10þ 4 ¼ 10þ�4. If the multiplier’s MSD isgreater than five, the next more significant digit, an impliedzero, is incremented to one, resulting in an additionalpartial product.

With the necessary multiplicand multiples available, thecontrol signals derived from the multiplier operands’ digitsselect the pþ 1 appropriate multiples to serve as the partial

products. Each partial product is prefixed with a digit torealize sign extension of the negative multiples [12].Further, if the next less significant partial product is anegative multiple, the current partial product is suffixedwith a digit of one to complete the 10’s complement. Thepartial products, after being aligned to match the order oftheir respective multiplier digits, enter an accumulation treeto be reduced to a 2p-digit intermediate product in carry-save format. These functions are shown in the “5:1 Multi-plexor” and “Complement/XOR Bank” blocks in Fig. 7.

The reduction scheme makes use of two additionaladvantages of the BCD-4221 coding. First, BCD-4221 allowsthe use of binary addition within each decimal digitposition, since all 16 combinations of this four-bit code arevalid decimal numbers. Second, BCD-4221 is highly suitablefor decimal doubling, as its three MSBs can represent thevalues f0; 2; 4; 6; 8g. This set of values is required as theoutputs produced when decimal doubling a digit aref00; 02; 04; 06; 08; 10; 12; 14; 16; 18g, where the first digit ofeach element in the set is either zero or one and correspondsto the value of the carry-out bit. Thus, the second digit ofeach element (i.e., the three MSBs of the output digit) can beproduced directly from the input digit without concern ofperturbation by a carry-in. Should a carry bit emerge fromthe next less significant digit being doubled, it can besimply be placed in the LSB position. Fast decimal doublingis important as the intermediate carry bits emerging fromthe binary addition of two BCD-4221 digits need to bedecimal doubled to yield a correct decimal value.

Due to the aligning of the partial products based on theirrespective weights, the number of decimal digits to beaccumulated varies from 2 to pþ 1. For a p ¼ 16 design, abinary CSA tree layout for the worst case of 17 decimaldigits is shown on the left in Fig. 6. The tree is based on thestructure presented in [12] but modified to add anadditional partial product to support multiplier operandswith MSDs > 5. The worst case path through theaccumulation tree transverses six 4-bit 3:2 binary CSAsand three decimal digit doublers. A decimal partial productarray is shown on the right in Fig. 6 to illustrate how thepartial products are aligned and sign extended. Thecircuitry shown in Fig. 6 is contained in the “Partial ProductAccumulation/CSA Tree” block in Fig. 7.

The intermediate product emerging from the partialproduct accumulation tree is in sum-and-carry form, whereeach sum digit and carry digit is four bits. These digits arefirst converted to BCD-8421 before being added using ahigh-speed direct decimal CPA [18] to produce a 2p-digitBCD-8421 result. The direct decimal adder uses a Kogge-Stone network to quickly produce the carries betweendigits. This adder configuration is different than thatpresented in [12], as details of that adder are limited andthe aforementioned adder was available to the authors andprovides an efficient implementation. In Fig. 6, a finaldecimal digit doubler is omitted as its latency is hidden inthe sum generation portion of the decimal CPA.

4.2 Intermediate Product Shifting, Rounding, andSticky Generation

Once the intermediate product IP emerges from the CPA in

nonredundant form, it enters a 2p-digit left shifter to

produce the shifted intermediate product SIP . The shifter

is controlled by the shift-left amount SLA (see (5)). The

ERLE ET AL.: DECIMAL FLOATING-POINT MULTIPLICATION 911

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 11: Decimal Floating-Point Multiplication

more significant half of the SIP , called the truncated

product TP , is incremented, while the less significant half

of the SIP , called the fractional product FRP , is used to

produce the sticky bit. Using the rounding scheme

presented in Section 3.6, the rounding logic produces the

rounded intermediate product RIP by selecting between

the nonincremented truncated product TPþ0 and its

incremented value TPþ1 and possibly concatenating the

guard digit from the FRP or its incremented value.

Completion of the multiply operation involves the same

steps as in the iterative DFP multiplier, namely, selecting

between RIP and a special value such as NaN or infinity

and producing the final result in DPD-encoded form.Performing the shift operation before the CPA in the

fixed-point multiplier was examined but was not imple-

mented due to the following trade-offs. The primary

benefits of using a smaller p-digit CPA after the shifter

and the potential to combine the final addition and

rounding are outweighed by several negative factors. First,

the calculation of the sticky bit from a redundant carry-save

representation requires roughly twice as many gates to

perform the calculation, increasing the area of the rounding

logic. Second, a separate carry tree is required to calculate

any possible carry out of the redundant representation of

the FRP along with more complicated rounding logic to

handle this possible carry into the round digit during result

selection. It also requires the generation of TPþ2, in

addition to TPþ1 and TPþ0, to handle the case of both a

round up and carry into the round digit. Third, shifting

prior to adding requires two 2p-digit shifters, instead of

one, as the intermediate product has sum and carry vectors

comprised of 4-bit digits. Since the fixed-point multi-

plication dominates the latency of the DFP multiplication,

we believed that the small delay benefit of performing the

shift earlier are outweighed by the additional area overhead

of the 2p-digit shifter and larger sticky calculation logic. For

these reasons, the shifter was placed after the CPA, as

shown in Fig. 7.In contrast with the iterative multiplier design that

developed the sticky bit on the fly, the sticky bit in theparallel DFP multiplier is calculated from the least sig-nificant p� 2 digits of the fractional product FRP . Thecalculation is simply the OR of all these bits. A method toprecalculate the sticky bit in parallel with the fixed-point

912 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

Fig. 6. Reduction tree for 17 partial products using 4-bit binary CSAs and decimal digit doublers.

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 12: Decimal Floating-Point Multiplication

multiplier was considered. However, this method did notreduce the worst case delay, and it increased the area.

4.3 Exception Handling

Although virtually all the exception detection and handlingis the same in the parallel design as it is in the iterativedesign, the parallel design does not support gradualunderflow. Rather, the detection of IEIP < Emin simplyleads to the raising of an output flag to inform the systemthat some other mechanism is needed to calculate thesubnormal result. The motivation behind this decision is theconsiderable savings in area and delay obtained byremoving this feature from the hardware implementation.

In the iterative multiplier, it is straightforward to realize the

necessary right shifting of the intermediate product.

However, in the parallel DFP multiplier, right shifting the

intermediate product requires the replacement of the left

shifter with a right-left shifter. If this design choice is made,

a shift-right amount SRA is needed to control the shifter

and to increase the intermediate exponent of the shifted

intermediate product IESIP . The equations for SRA and

IESIP follow, where IEIP < Emin:

SRA ¼min ðEmin� IEIP Þ; pþ 2� �

;

IESIP ¼ IEIP þ SRA:

ERLE ET AL.: DECIMAL FLOATING-POINT MULTIPLICATION 913

Fig. 7. Parallel DFP multiplier design.

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 13: Decimal Floating-Point Multiplication

4.4 Implementation

The data-flow portion of the 16-digit parallel DFP multiplierdesign is depicted in Fig. 7. The control portion of thisdesign is very similar to that described throughout Section 3,with the exceptions that the sticky counter SC is notneeded. Note that as the register-transfer-level (RTL) modelfor this design was written without storage elements to takeadvantage of an autopipelining feature of Synopsys DesignCompiler, the actual placement of storage elementsthroughout the data flow may differ slightly from thoseshown in the figure. More information on the RTL modeland autopipelining appears in Section 5.

The first step in Fig. 7 is to generate in parallel themultiplicand multiples f1CA; . . . ; 5CAg in BCD-4221 code,the multiple selects f0; . . . ; 5g, and the multiple complementsignals (see Section 4.1). The 17 partial products, the longestof which can be up to 19 digits due to a carry-out digit whengenerating the multiplicand multiple, a sign extension digit,and a 10’s complement correction digit, enter the binaryCSA tree. The intermediate product The IP emerging fromthe reduction tree is in carry-save form. To remove theredundancy, IP is sent through a 32-digit CPA. As the lesssignificant digit positions of the IP have fewer partialproduct digits to reduce, the 32-digit CPA’s overall latencyapproaches the latency of a 16-digit CPA. The CPA’s outputis input to a 32-digit left shifter controlled by the shift-leftamount SLA. The left shifter produces the shifted inter-mediate product SIP based on SLA values ranging from 0to 16 (see Section 3.3).

The more significant half of the SIP is called thetruncated product TP . As a selection is to be made betweenTP , or TPþ0, and its incremented value TPþ1, TP is sentthrough a 16-digit decimal incrementer. In parallel with thisincrementer, the less significant half of the SIP , called thefractional product FRP , is used to produce the sticky bit sb.This is achieved by sending the 14 LSDs of FRP through a56-bit OR tree (the leading digits of FRP are the guard andround digits g and r, respectively). Using TPþ0, TPþ1, g, r,and sb, the rounding logic described in Section 3.6 producesthe 16-digit rounded intermediate product RIP . Theremainder of the operation is as described in Section 3.7for the iterative DFP multiplier. This parallel DFP design iscompared with the iterative DFP design in the next section.

5 RESULTS AND ANALYSIS

RTL models of both the 64-bit (16-digit) iterative andparallel DFP multipliers, and their predecessor decimal

fixed-point multipliers [9], [12] were coded in Verilog. Tomake the comparison between the iterative and paralleldesigns as balanced as possible, the iterative design wasconverted to not support gradual underflow. This alterationhas no effect on the critical path, and there is virtually nochange in area. The designs were synthesized using LSILogic’s gflxp 0.11-�m CMOS standard cell library andSynopsys Design Compiler Y-2006.06-SP1. To validate thecorrectness of the design, more than 50,000 test casescovering all rounding modes and exceptions were simu-lated successfully on the designs, both presynthesis andpostsynthesis. Publicly available tests used for validationinclude directed pseudorandom test cases from IBM’sFPgen tool [25] and directed tests available in [26].

Table 3 contains area and delay estimates for the DFPmultiplier designs and their predecessor decimal fixed-point multiplier designs. The values in the FO4 Delaycolumn are based on the delay of an inverter driving foursame-size inverters having a 55-ps delay, in the aforemen-tioned technology. The parallel decimal fixed-point andDFP multipliers have 8 and 12 pipeline stages, respectively,to achieve critical path delays that are comparable to theiterative multiplier designs. The critical path in the iterativeDFP multiplier is in the stage with the 128-bit barrel shifter,while for the iterative decimal fixed-point multiplier, it isthe decimal 4:2 compressor. As for the parallel DFPmultiplier, the critical path is the 128-bit shifter, while forthe fixed-point portion, it is within the partial productreduction tree. The critical paths identified in the parallelmultipliers are with respect to the implementations shownin Table 3.

The entry in Table 3 for the parallel DFP multiplier waschosen from the outputs of synthesis runs using SynopsysDesign Compiler’s autopipelining feature. This featureprovides area and delay results for design implementationsof various pipeline depths. With this information, one canmore readily examine the costs associated with a desireddesign point. Table 4 shows the results of several synthesisruns using this autopipelining feature. One slight drawbackof autopipelining is that there are no latches on the inputs.Thus, the area numbers are slightly optimistic whencompared to the iterative DFP multiplier’s numbers.

The iterative DFP multiplier is significantly smaller andmay achieve a higher frequency of operation than theparallel DFP multiplier. Thus, in situations where the areaavailable for DFP is an important design constraint, theiterative DFP multiplier may be an attractive implementa-tion. However, the parallel DFP multiplier has less latency

914 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

TABLE 3Area and Delay for Decimal Multipliers

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 14: Decimal Floating-Point Multiplication

for a single multiply operation and is able to produce a newresult every cycle. As for power considerations, the feweroverall devices in the iterative multiplier and, moreimportantly, the fewer storage elements should result inless leakage. This benefit is mitigated by its higher latencyand lower throughput.

6 SUMMARY

A justification for decimal arithmetic hardware and amotivation for DFP multiplication are presented. The designcomponents necessary to extend two previously publisheddecimal fixed-point multipliers, one iterative and oneparallel, are described. The components include multiplicandmultiple generation, multiplier operand recoding, partialproduct generation, exponent generation, sticky bit genera-tion, shifting of the intermediate product, rounding, andexception detection and handling. The algorithms and block-level drawings of a proposed iterative and parallel DFPmultiplier are presented. Novel features of the proposedmultipliers include support for DFP numbers, on-the-flygeneration of the sticky bit in the iterative multiplier, earlyestimation of the shift amount, and efficient decimal round-ing. Area and delay estimates are presented from thesynthesized results of the verified RTL models.

APPENDIX

PROOF OF NO ROUNDING OVERFLOW

The proof that no rounding overflow can occur is ageneralization of the fact that two significands, each withsignificance equal to the precision, will yield an inter-mediate product with significance 2p� 1 or 2p. If thesignificance of the intermediate product SIP is 2p� 1,then there will be a zero in the MSD position after leftshifting the intermediate product (see Section 3.3). In thiscase, an increment due to rounding will not propagatebeyond the MSD position. Alternatively, if SIP ¼¼ 2p,then in order for overflow to occur, the minimum valueof a string of nines that is p digits long must start inthe MSD position. Thus, the minimum value of the

intermediate product needed for overflow can be ex-pressed as IP ¼ CA � CB ¼ 102p � ð10p � 1Þ. However, themaximum intermediate product is 102p � 10p � ð10p � 1Þ,as demonstrated below. Since the maximum intermediateproduct is 10p less than the minimum intermediateproduct needed for overflow, rounding overflow cannotoccur. The generalized proof that rounding overflow doesnot occur is given below.

Proof: No rounding overflow. Given the range of sig-

nificands CA and CB as

10SA�1 � CA � 10S

A � 1;

10SB�1 � CB � 10S

B � 1;

the range of the intermediate product CIP ¼ CA � CB is

10SA�1 � 10S

B�1 � CIP � ð10SA � 1Þ � ð10S

B � 1Þ

or, equivalently,

10SAþSB�2 � CIP � 10S

AþSB � 10SA � ð10S

B � 1Þ:

Using the shift-left amount SLA ¼ 2p� SA � SB, the

upper bound for the shifted intermediate product is

CSIP � CIP � 102p�SA�SB

or, equivalently,

CSIP � 102p � 102p�SB � ð102p�SA � 102p�SA�SBÞ:

Substituting p for SA and SB, as this is when the shifted

intermediate product is at its maximum, yields

CSIP � 102p � 10p � ð10p � 1Þ:

This maximum achievable value of the shifted inter-mediate product is less than the minimum valuerequired for overflow, 102p � ð10p � 1Þ, thus demonstrat-ing that overflow cannot occur. tu

ACKNOWLEDGMENTS

This work is sponsored in part by IBM. The authors wishto thank Andrew Krioukov for his contributions to thedesign of the parallel DFP multiplier. They are grateful toAlvaro Vazquez, Elisardo Antelo, and Paolo Montuschi fortheir innovative parallel fixed-point multiplier design usedin our parallel DFP multiplier. They are also grateful for thefeedback they received from the anonymous reviewers.

REFERENCES

[1] M.A. Erle, M.J. Schulte, and J.M. Linebarger, “Potential SpeedupUsing Decimal Floating-Point Hardware,” Proc. 36th AsilomarConf. Signals, Systems and Computers (ACSSC ’02), vol. 2,pp. 1073-1077, Nov. 2002.

[2] L.-K. Wang, C. Tsen, M.J. Schulte, and D. Jhalani, “Benchmarksand Performance Analysis for Decimal Floating-Point Applica-tions,” Proc. 25th IEEE Int’l Conf. Computer Design (ICCD ’07),pp. 164-170, Oct. 2007.

[3] L. Eisen, J.W. Ward III, H.-W. Tast, N. Mading, J. Leenstra,S.M. Mueller, C. Jacobi, J. Preiss, E.M. Schwarz, and S.R. Carlough,“IBM POWER6 Accelerators: VMX and DFU,” IBM J. Researchand Development, vol. 51, no. 6, pp. 663-684, Nov. 2007.

ERLE ET AL.: DECIMAL FLOATING-POINT MULTIPLICATION 915

TABLE 4Area and Delay for Parallel DFP Multiplier Designs

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.

Page 15: Decimal Floating-Point Multiplication

[4] IEEE Standard for Floating-Point Arithmetic, IEEE Working Groupof the Microprocessor Standards Subcommittee, IEEE, 2008.

[5] ANSI/IEEE Std 854-1987: IEEE Standard for Radix-IndependentFloating-Point Arithmetic, Floating-Point Working Group, IEEE,Oct. 1987.

[6] R.H. Larson, “High Speed Multiply Using Four Input Carry SaveAdder,” IBM Technical Disclosure Bull., pp. 2053-2054, Dec. 1973.

[7] R.L. Hoffman and T.L. Schardt, “Packed Decimal MultiplyAlgorithm,” IBM Technical Disclosure Bull., vol. 18, no. 5,pp. 1562-1563, Oct. 1975.

[8] T. Ohtsuki, Y. Oshima, S. Ishikawa, K. Yabe, and M. Fukuta,Apparatus for Decimal Multiplication, US patent 4,677,583, June 1987.

[9] M.A. Erle and M.J. Schulte, “Decimal Multiplication via Carry-Save Addition,” Proc. 14th IEEE Int’l Conf. Application-SpecificSystems, Architectures, and Processors (ASAP ’03), pp. 348-358,June 2003.

[10] T. Lang and A. Nannarelli, “A Radix-10 Combinational Multi-plier,” Proc. 40th Asilomar Conf. Signals, Systems, and Computers(ACSSC ’06), pp. 313-317, Oct./Nov. 2006.

[11] R.K. Richards, Arithmetic Operations in Digital Computers.D. Van Nostrand, 1955.

[12] A. Vazquez, E. Antelo, and P. Montuschi, “A New Family ofHigh-Performance Parallel Decimal Multipliers,” Proc.18th IEEESymp. Computer Arithmetic (ARITH ’07), pp. 195-204, June 2007.

[13] M.A. Erle, M.J. Schulte, and B.J. Hickmann, “Decimal Floating-Point Multiplication via Carry-Save Addition,” Proc. 18th IEEESymp. Computer Arithmetic (ARITH ’07), pp. 46-55, June 2007.

[14] B.J. Hickmann, A. Krioukov, M.A. Erle, and M.J. Schulte, “AParallel IEEE P754 Decimal Floating-Point Multiplier,” Proc. 25thIEEE Int’l Conf. Computer Design (ICCD ’07), pp. 296-303, Oct. 2007.

[15] M.S. Cohen, T.E. Hull, and V.C. Hamacher, “CADAC: AControlled-Precision Decimal Arithmetic Unit,” IEEE Trans.Computers, vol. 32, no. 4, pp. 370-377, Apr. 1983.

[16] G. Bohlender and T. Teufel, “BAP-SC: A Decimal Floating-PointProcessor for Optimal Arithmetic,” Computer Arithmetic: ScientificComputation and Programming Languages. B.G. Teubner, pp. 31-58,1987.

[17] M.F. Cowlishaw, “Decimal Floating-Point: Algorism for Compu-ters,” Proc. 16th IEEE Symp. Computer Arithmetic (ARITH ’03),pp. 104-111, June 2003.

[18] M.S. Schmookler and A.W. Weinberger, “High Speed DecimalAddition,” IEEE Trans. Computers, vol. 20, no. 2 pp. 862-867,Aug. 1971.

[19] M.F. Cowlishaw, “Densely Packed Decimal Encoding,” IEEEProc.—Computers and Digital Techniques, vol. 149, no. 3,pp. 102-104, May 2002.

[20] M.F. Cowlishaw, E.M. Schwarz, R.M. Smith, and C.F. Webb, “ADecimal Floating-Point Specification,” Proc. 15th IEEE Symp.Computer Arithmetic (ARITH ’01), pp. 147-154, July 2001.

[21] N.T. Quach, N. Takagi, and M.J. Flynn, “Systematic IEEERounding Method for High-Speed Floating-Point Multipliers,”IEEE Trans. VLSI Systems, vol. 12, no. 5, pp. 511-521, May 2004.

[22] L.-K. Wang and M. Schulte, “Decimal Floating-Point Adder andMultifunction Unit with Injection-Based Rounding,” Proc. 18thIEEE Symp. Computer Arithmetic (ARITH ’07), pp. 55-65, June 2007.

[23] G. Even and P.-M. Seidel, “A Comparison of Three RoundingAlgorithms for IEEE Floating-Point Multiplication,” IEEE Trans.Computers, vol. 49, no. 7, pp. 638-650, July 2000.

[24] M.S. Schmookler and K.J. Nowka, “Leading Zero Anticipation andDetection—A Comparison of Methods,” Proc. 15th IEEE Symp.Computer Arithmetic (ARITH ’01), pp. 7-12, July 2001.

[25] R.M.M. Aharoni, R. Maharik, and A. Ziv, “Solving Constraints onthe Intermediate Result of Decimal Floating-Point Operations,”Proc. 18th IEEE Symp. Computer Arithmetic (ARITH ’07), pp. 38-45,June 2007.

[26] General Decimal Arithmetic Testcases, IBM, http://www2.hursley.ibm.com/decimal/dectest.html, 2007.

Mark A. Erle received the BS degree inelectrical engineering from Pennsylvania StateUniversity, the MS degree in electrical engineer-ing from the University of Vermont, and the PhDdegree in computer engineering from LehighUniversity. Since his career began with IBM in1990, he has worked in design for testability,circuit design, synthesis, and logic design onmicroprocessor programs ranging from personalcomputers to mainframes. His research interests

include computer arithmetic and low-power/high-frequency circuit de-sign. He is a senior member of the IEEE.

Brian J. Hickmann received the BS degree incomputer engineering and mathematics andthe MS degree in computer engineering, underthe direction of Dr. Michael Schulte, from theUniversity of Wisconsin, Madison, in 2006 and2008, respectively. He currently is with IntelCorporation-Ronler Acres, Hillsboro, Oregon,working on the Larrabee project and pursuing acareer in high-speed circuit design with anemphasis on floating-point units. His current

research interests include computer arithmetic, high-speed/low-powermicroarchitecture units, domain-specific accelerators, and computerarchitecture. He is a member of the IEEE.

Michael J. Schulte received the BS degree inelectrical engineering from the University ofWisconsin, Madison, and the MS and PhDdegrees in electrical engineering from the Uni-versity of Texas at Austin. He is currently anassociate professor of computer engineering inthe Department of Electrical and ComputerEngineering, University of Wisconsin, Madison,where he leads the Madison Embedded Sys-tems and Architectures Group. His research

interests include high-performance embedded processors, computerarchitecture, domain-specific systems, and computer arithmetic. He anassociate editor for the Journal of VLSI Signal Processing. He is a seniormember of the IEEE and the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

916 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 7, JULY 2009

Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 29,2010 at 04:42:48 UTC from IEEE Xplore. Restrictions apply.