Decimal Floating-point Multiplication via Carry-Save Addition Mark Erle Systems & Technology Group International Business Machines Brian Hickmann & Mike

Decimal Floating-point MultiplicationDecimal Floating-point Multiplicationvia Carry-Save Additionvia Carry-Save Addition

Mark ErleMark ErleSystems & Technology Systems & Technology

GroupGroupInternational Business International Business

MachinesMachines

Brian Hickmann & Mike SchulteElectrical & Computer

EngineeringUniversity of Wisconsin at

Madison

22

OutlineOutline

• Introduction and motivationIntroduction and motivation

•Extensions to fixed-point Extensions to fixed-point designdesign

• Implementation highlightsImplementation highlights

•Verification and synthesis Verification and synthesis resultsresults

•SummarySummary

33

IntroductionIntroduction

• Preponderance of business data in decimal Preponderance of business data in decimal formform

• Inexact mapping between decimal and binaryInexact mapping between decimal and binary

• Decimal arithmetic used/required in banking, Decimal arithmetic used/required in banking, finance, insurance, accountingfinance, insurance, accounting

• Increasing support in arithmetic community, Increasing support in arithmetic community, (IEEE P754 in ballot review process)(IEEE P754 in ballot review process)

• Multiplication a key functionMultiplication a key function

44

MotivationMotivation

• What's involved in extending fixed-What's involved in extending fixed-point multiplication to support point multiplication to support floating-point?floating-point?

• What are the similarities and What are the similarities and differences with BFP multiplication?differences with BFP multiplication?

55

R ead an d D eco d e O p eran d s

G en erate T u p les

S elect T u p les

A ccu mu late Par t ial Pro d u ct s /D evelo p S t icky

L eft S h ift (R ed u n d an t ) I n termed iate Pro d u ct

A d d (R ed u n d an t ) S h ift ed I n termed iate Pro d u ct

R o u n d

S t icky C o u n ter C o n t ro l

I t erat io n C o n t ro l

R o u n d C o n t ro l

S elect S ig n ifican d

S h ift A mo u n t C o n t ro l

Exponent/Contro l Dataflow

E n co d e an d W r ite R es u lt

E x cep t io n C o n t ro l

F in aliz e E x p o n en t

= New function to support floating-po int= Modified function to support floating-po int

iterativeportion

T u p le S elect io n C o n t ro l

G en erate I n termed iate E x p .

66

Intermediate Exponent Intermediate Exponent CalculationCalculation

• Preferred exponent:Preferred exponent:PE = EPE = EAA + E + EBB - bias - bias

• Based on location of Based on location of the decimal point the decimal point (effective shift right):(effective shift right):IEIEIPIP = PE + p = PE + p

• After left shifting the After left shifting the intermediate intermediate product:product:IEIESIPSIP = IE = IEIPIP – SLA – SLA

Mu lt ip lican d(p d ig it s )

T u p le G en erat io nan d S elect io n

A ccu mu lato r

I n termed iate Pro d u ct R eg is ter(2 p d ig it s )

L o c a t io n o fd e c im a l p o in t

e ff e c t iv e ly as h ift re g is te r

77

Intermediate Product Intermediate Product ShiftingShifting

• Based on leading zero Based on leading zero counts of operandscounts of operands

• SLASLA may be off by may be off by one; need guard digitone; need guard digit

• SLA = min(LZSLA = min(LZAA + LZ + LZBB, p), p)

• Shift right when Shift right when IEIEIPIP < < EminEmin

I n termed iate Pro d u ct R eg is ter(2 p d ig it s )

L o c a t io n o fd e c im a l p o in t

L Z A + L Z B p

88

Sticky Bit GenerationSticky Bit Generation

• Logically, all bits beyond the round digit Logically, all bits beyond the round digit must be ORed after left shiftingmust be ORed after left shifting

• SC = SSC = SIPIP – p – 2 – p – 2, where , where 22 is for is for gg and and rr• Generate sticky bit on-the-fly, ORing Generate sticky bit on-the-fly, ORing

one digit at a time while decrementing one digit at a time while decrementing SCSC

• SC = min(0, p – (LZSC = min(0, p – (LZAA - LZ - LZBB))))– SSIPIP - p = ((p – LZ - p = ((p – LZAA) + (p – LZ) + (p – LZBB)) – p)) – p– Calculate two cycles prior to when neededCalculate two cycles prior to when needed

99

Rounding - SchemeRounding - Scheme

• No rounding overflow... simplifies schemeNo rounding overflow... simplifies scheme

• Unique compound adder neededUnique compound adder needed– SIPSIP may be in redundant form may be in redundant form– Require Require CCSIPSIP+0+0 and and CCSIPSIP+1; named C+1; named C+0+0 and C and C+1+1

• Possible corrective left shift (Possible corrective left shift (clscls) of one ) of one digitdigit– SSIPIP = S = SAA + S + SBB or S or SAA + S + SBB - 1 - 1– Adder Adder pp digits wide digits wide– Concatenate Concatenate gg or or g + 1g + 1

1010

Rounding – Scheme Rounding – Scheme ContinuedContinued• Three cases based on MSDs of Three cases based on MSDs of CC+0+0 and and

CC+1+1

– No leading zeros, no corrective left shiftNo leading zeros, no corrective left shift– Leading zeros, possible corrective left shiftLeading zeros, possible corrective left shift– Zero followed by all ninesZero followed by all nines

• Logically, select one among the followingLogically, select one among the following– CC+0+0 , C , C+1+1

– CC+0+0 « 1 || g « 1 || g, , CC+0+0 « 1 || g + 1 « 1 || g + 1– CC+1+1 « 1 || g « 1 || g, , CC+1+1 « 1 || g + 1 « 1 || g + 1– Zero, largest finite number, infinityZero, largest finite number, infinity

1111

Exception Detection & Exception Detection & HandlingHandling

• Invalid operationInvalid operation– sNaN (pass significand of sNaN)sNaN (pass significand of sNaN)– 0 x ∞ (produce qNaN with significand 0 x ∞ (produce qNaN with significand 00))

• Overflow (and Inexact)Overflow (and Inexact)– IEIEIPIP – SLA > Emax – SLA > Emax– Increase Increase SLASLA until all LZs removed until all LZs removed

• Underflow (and possibly Inexact)Underflow (and possibly Inexact)– IEIEIPIP – SLA < Emin – SLA < Emin– Decrease Decrease SLASLA until 0, then shift right until 0, then shift right

• InexactInexact

1212

I n t e rme d ia t e P ro d . Re g. ( M a s t e r/ S la v e w / Re s e t ) B/ I n t e rme d ia t e P ro d .Re g. ( M a s t e r/ S la v e )

5 b it s / d ig it ( d a t a in c a rry - s a v e fo rm) 4 b it s / d ig it

C o mp o u n d A d d e r ( p D ig it s )

S h if t e d No n - Re d u n d a n t P ro d .Re g. ( M a s t e r/ S la v e )

l g rl

g u a rdd ig it

( 4 b it s )s u m

( 4 b it s )

c a rry( 1 b it )

F in a l S h if t Le f t By O n e

S LA

F in a l P ro d u c t Re g is t e r ( M a s t e r/ S la v e )

O n - t h e - fl y s t ic kyg e n e ra t io n

S Cd ig it f ro m ro u n dd ig it p o s it io n

4 b it s / d ig it

Le g e n dS C = S t ic ky C o u n t e rS LA = S h if t Le f t A mo u n tl, g, r = LS D, g u a rd , a n d ro u n d d ig it p o s it io n ss b = s t ic ky b it

No t eLe s s s ig n ifi c a n t h a lf o f in t e rme d ia t e p ro d u c tre g is t e r a n d s h if t e d n o n - re d u n d a n t p ro d u c tre g is t e r c a n b e s h a re d

s b

s t ic kyb it

( 1 b it )

ro u n dd ig it

( 4 b it s )

g r s b

s b

Ro u n d

5 b it s / d ig it

2 : 1 M u lt ip le xo r

C + 0

C + 1

S h if t e d I n t e rme d ia t e P ro d .Re g. ( M a s t e r/ S la v e )

C P

Lo c a t io n o f d e c ima l p o in t

Le f t S h if t e r

l

S I P

I P

C + 1 C+ 0

C + 0

C+ 0

1313

Implementation HighlightsImplementation Highlights

• Leverage operands' LZCsLeverage operands' LZCs– SCSC, , SLASLA, and , and IEIESIPSIP

• Handle NaNs with minimal overheadHandle NaNs with minimal overhead– No dataflow modificationNo dataflow modification– Coerce multiplicand or multiplier to 1Coerce multiplicand or multiplier to 1

• Support gradual underflowSupport gradual underflow– No dataflow modificationNo dataflow modification– Simply extend number of iterationsSimply extend number of iterations

• Simple, control-based rounding schemeSimple, control-based rounding scheme

1414

RTL Model and VerificationRTL Model and Verification

• Verilog model for both fixed-point and Verilog model for both fixed-point and floating-point multiplier designsfloating-point multiplier designs

• All rounding modes, NaNs, exceptionsAll rounding modes, NaNs, exceptions

• Over 500,000 random & directed Over 500,000 random & directed testcasestestcases– IBM decNumber basedIBM decNumber based– IBM Haifa's FPgen (IEEE754R compliance)IBM Haifa's FPgen (IEEE754R compliance)– IBM dectestIBM dectest

• Validated pre- and post-synthesisValidated pre- and post-synthesis

1515

Synthesis ResultsSynthesis Results

• 64-bit (16 digit) operands, DPD encoded64-bit (16 digit) operands, DPD encoded

• LSI Logic's gflxp 0.11um CMOS, 55ps FO4LSI Logic's gflxp 0.11um CMOS, 55ps FO4

• Synopsys Design CompilerSynopsys Design Compiler

• ResultsResults– Fixed-pointFixed-point 119,653 um119,653 um22 14.72 FO4s14.72 FO4s– Floating-pointFloating-point 237,607 um237,607 um22 15.45 FO4s15.45 FO4s

• Critical pathCritical path– Fixed-pointFixed-point 4:2 compressor 4:2 compressor

(accumulator)(accumulator)– Floating-pointFloating-point 128-bit barrel shifer128-bit barrel shifer

1616

Applicability to Parallel Applicability to Parallel DesignsDesigns• IEIE and and IPIP shift generation shift generation

• Rounding schemeRounding scheme

• NaN handlingNaN handling

• Exception detection and handlingException detection and handling

• On-the-fly sticky bit generation... NOOn-the-fly sticky bit generation... NO

1717

Sequential vs. ParallelSequential vs. Parallel

• SequentialSequential– Less areaLess area– Potentially better cycle timePotentially better cycle time

• ParallelParallel– Less latencyLess latency– Higher throughputHigher throughput

1818

SummarySummary

• Extended fixed-point, serial multiplier Extended fixed-point, serial multiplier to support floating-pointto support floating-point

• Leveraged operands' LZCsLeveraged operands' LZCs

• Developed an efficient rounding schemeDeveloped an efficient rounding scheme

• Verified RTL and gate-level modelsVerified RTL and gate-level models

• Presented area and delay numbers for Presented area and delay numbers for fixed- and floating-point designsfixed- and floating-point designs

• Discussed applicability to parallel designs Discussed applicability to parallel designs

1919

Et voilà!Et voilà!

Vive le système décimale!Vive le système décimale!

2020

Backup SlidesBackup Slides

2121

No Rounding OverflowNo Rounding Overflow

• If If SSIPIP = 2p – 1 = 2p – 1– MSD == 0MSD == 0– Increment will not cause rounding overflowIncrement will not cause rounding overflow

• If If SSIPIP = 2p = 2p– TThen we must have string of p 9s hen we must have string of p 9s – p 9s is greater than maximum productp 9s is greater than maximum product– No rounding overflow possibleNo rounding overflow possible

• Simplifies rounding schemeSimplifies rounding scheme

2222

Decimal Storage FormatDecimal Storage Format

Documents

Decimal Floating-point Multiplication via Carry-Save Addition Mark Erle Systems & Technology Group International Business Machines Brian Hickmann & Mike