42
Digital Signal Processing with Field Programmable Gate Arrays 2. Computer Arithmetic

Computer Arithmetic

Embed Size (px)

Citation preview

Page 1: Computer Arithmetic

Digital Signal Processing with Field

Programmable Gate Arrays2. Computer Arithmetic

Page 2: Computer Arithmetic

Survey of number representations

NUMBER SYSTEMS

Fixed-point Floating-point

conventional unconventional conventional unconventional

Two’s complement

Diminished by one

Signed Magnitude

One’s complement

Signed digit (SD)

Residue NumberSystem (RNS)

Logarithmic NumberSystem (LNS)

IEEE 75419-Bit Splash II

format

Single precision 8bit Exponent 23bit Mantissa

Double precision 11bit Exponent 52bit Mantissa

Page 3: Computer Arithmetic

Fixed point vs Floating point

Advantages Disadvantages

Fixed pointArea & power

efficientDynamic range and

accuracy

Floating point High precision

Expensive in terms of area and power (

computationally intensive )

Page 4: Computer Arithmetic

Conventional coding of signed binary numbers

Binary 2C 1C D1 SM

011 3 3 4 3

010 2 2 3 2

001 1 1 2 1

000 0 0 1 0

111 -1 -0 -1 -3

110 -2 -1 -2 -2

101 -3 -2 -3 -1

100 -4 -3 -4 -0

Page 5: Computer Arithmetic

Which representation is the best? Why?

• Virtually all modern computers operate based on 2's complement representation.

• Why?1. hardware for doing the most common operations is faster(the most common operation is addition)2. hardware is simpler

Page 6: Computer Arithmetic

Canonical Signed Digit(CSD)• Cost of adder and subtractor is identical in hardware• CSD notation uses 1, 0 and -1 to represent numbers• 7 = 0111 and can also be represented as • reduces the number of non-zero bits in a number• reduces the size of multipliers• Useful only for hardwired (constant) multiplications• Great for communication and signal processing• applications (filters, fft, etc.)

1100

Page 7: Computer Arithmetic

Carry-free Adder

xkyk 00 01 01 11

xk-1yk-1 - Neither is

At least one is

Neither is

At least one is - -

ck(carry of t

he Kth bit) 0 1 0 0 1uk(Interim s

um) 0 1 1 0 0

10

1

11

1 1 1

1

1 1

1

10

Page 8: Computer Arithmetic

Logarithmic Number Systems

• Multiplication, division, roots, powers is simple• Reduce Multiplication (addition ↓), Division (subtraction ↓), Powers (multipli

cation ↓), Roots (division ↓)• More Complex Add, Subtract• Major Problem 지수 (Logarithm) 와 진수 (Antilogarithm - logX 의 X) 는 Conventional

Number 로 상호 빠르게 틀림없이 변환되는 것을 충분히 허용한다 . 이 변환은 항상 부정확한 결과로 근사치를 이끌어 낸다 .

따라서 , binary logarithms 는 단지 매우 적은 변환이 요구되어지는 특별한 어플리케이션 (e.g., real-time digital filter) 전용의 산술연산 유닛에 유용하다 . 그러나 , 많은 multiplication 과 division 이 실행된다 .

Page 9: Computer Arithmetic

Residue Number System• Addition, subtraction, multiplication 에서 Carry-free 한 고유의 탁월한 특성을

갖는 integer number system 이다 .• 난해한 수의 길이에 상관없이 비슷하게 Add, subtract, or multiply 되어진다 .• 불행하게도 division, comparison, sign detection 같은 다른 산술 연산은

매우 복잡하고 느리다 .• 소수 표현은 매우 부적당하다 .• 따라서 , 일반적인 사용을 위해서는 심각하게 고려하지 않는다 .• 그렇지만 , 디지털 필터의 많은 유형과 같은 몇몇 특수목적

어플리케이션에는 addition 과 multiplication 수는 대체로 invocations of magnitude comparison, overflow detection, division 의 수보다 크거나 같다 .

• However, for some special-purpose applications such as many types of digital filters, in which the number of additions and multiplications is substantially greater than the number of invocations of magnitude camparison, overflow detection, division, and alike.

• 매우 흥미로운 표현방식이다 .

Page 10: Computer Arithmetic

Adder

• Reduce the carry delayscarry-skip, carry lookahead, conditional sum, carry-select adders –> can speed up addition and can be used with older generation FPGA families(do not provide internal fast carry logic)

Page 11: Computer Arithmetic

Pipelined Adders

Page 12: Computer Arithmetic

Modulo Adders

Page 13: Computer Arithmetic

Multiplier

Page 14: Computer Arithmetic

Distributed arithmetic isn't magic.   Let's demystify it:

• Distributed arithmetic is a bit level rearrangement of a multiply accumulate to hide the multiplications. It is a powerful technique for reducing the size of a parallel hardware multiply-accumulate that is well suited to FPGA designs. It can also be extended to other sum functions such as complex multiplies, fourier transforms and so on.

Page 15: Computer Arithmetic

Distributed arithmetic• In most of the multiply accumulate applications in signal processing, one of the multiplicands for

each product is a constant. Usually each multiplication uses a different constant.• Using our most compact multiplier, the scaling accumulator, we can construct a multiple product

term parallel multiply-accumulate function in a relatively small space if we are willing to accept a serial input. In this case, we feed four parallel scaling accumulators with unique serialized data. Each multiplies that data by a possibly unique constant, and the resulting products are summed in an adder tree as shown below.

• If we stop to consider that the scaling accumulator multiplier is really just a sum of vectors, then it becomes obvious that we can rearrange the circuit.

Page 16: Computer Arithmetic

Distributed arithmetic• Here, the adder tree combines the 1 bit partial products before they are accumulated by the scaling

accumulator. All we have done is rearranged the order in which the 1xN partial products are summed. Now instead of individually accumulating each partial product and then summing the results, we postpone the accumulate function until after we’ve summed all the 1xN partials at a particular bit time. This simple rearrangement of the order of the adds has effectively replaced N multiplies followed by an N input add with a series of N input adds followed by a multiply. This arithmetic manipulation directly eliminates N-1 Adders in an N product term multiply-accumulate function. For larger numbers of product terms, the savings becomes significant.

Page 17: Computer Arithmetic

Distributed arithmetic• Further hardware savings are available when the coefficients Cn are constants. If that is true, then the adder tree s

hown above becomes a boolean logic function of the 4 serial inputs. The combined 1xN products and adder tree is reduced to a four input look up table. The sixteen entries in the table are sums of the constant coefficients for all the possible serial input combinations. The table is made wide enough to accommodate the largest sum without overflow. Negative table values are sign extended to the width of the table, and the input to the scaling accumulator should be sign extended to maintain negative sums.

Page 18: Computer Arithmetic

Distributed arithmetic• Obviously the serial inputs limit the performance of such a circuit.  As with most hardware

applications, we can obtain more performance by using more hardware.  In this case, more than one bit sum can be computed at a time by duplicating the LUT and adder tree as shown here. The second bit computed will have a different weight than the first, so some shifting is required before the bit sums are combined. In this 2 bit at a time implementation, the odd bits are fed to one LUT and adder tree, while the even bits are simultaneously fed to an identical tree. The odd bit partials are left shifted to properly weight the result and added to the even partials before accumulating the aggregate. Since two bits are taken at a time, the scaling accumulator has to shift the feedback by 2 places.

Page 19: Computer Arithmetic

Distributed arithmetic• This paralleling scheme can be extended to compute more than two bits at a time. In the extreme case, all input b

its can be computed in parallel and then combined in a shifting adder tree. No scaling accumulator is needed in this case, since the output from the adder tree is the entire sum of products. This fully parallel implementation has a data rate that matches the serial clock, which can be greater than 100 MS/S in today's FPGAs.

Page 20: Computer Arithmetic

Distributed arithmetic• Most often, we have more than 4 product terms to accumulate. Increasing the size of the LUT might look attractive

until you consider that the LUT size grows exponentially. Considering the construction of the logic we stuffed into the LUT, it becomes obvious that we can combine the results from the LUTs in an adder tree. The area of the circuit grows by roughly 2n-1 using adder trees to expand it rather than the 2n growth experienced by increasing LUT size. For FPGAs, the most efficient use of the logic occurs when we use the natural LUT size (usually a 4-LUT, although and 8-LUT would make sense if we were using an 8 input block RAM) for the LUTs and then add the outputs of the LUTs together in an adder tree, as shown below.

Page 21: Computer Arithmetic

Coordinate Rotation Digital Computer (CORDIC)

CORDIC is an iterative algorithm for calculating trig functions including sine, cosine, magnitude and phase. It is particularly suited to hardware implementations because it does not require any multiplies.

• Birth of CORDICCORDIC was introduced by Volder in 1959 to calculate trigonometric values like sine, cosine, etc. In 1971, Walther extended this algorithm to calculate hyperbolic, logarithmic and other functions.

• AdvantagesThis algorithm uses only minimal hardware (adder and shift) for computation of all trigonometric and other function values. It consumes fewer resources than any other techniques and so the performance is high. Thus, almost all scientific calculators use the CORDIC algorithm in their calculations.

Page 22: Computer Arithmetic

CORDIC Principle• Principle

CORDIC works by rotating the coordinate system through constant angles until the angle is reduced to zero. So with this principle we are changing the given angle each time to reduce to zero. Here we are using addition, subtraction and shift to calculate the function values. Now let us see, how we can calculate sine and cosine values using CORDIC. Consider a vector C with coordinate (X, Y) that is to be rotated through an angle σ. The new coordinate (X′,Y′) after rotation is

This equation can be represented in tangent form asX/cos(σ) = X-Y x tan(σ)Y/cos(σ) = Y-X x tan(σ)

The angle is broken into smaller and smaller pieces,such that the tangent of the angle is always power of 2.The pre-calculated angles are also added to the total angleand thus the above equation can be written as

X(i+1) = t(i) x (X(i)-Y/2i)Y(i+1) = t(i) x (Y(i)-X/2i)

where t(i) = cos(arctan(1/2i)) i varies from 0 to nAccording to the above iterative equation t(i) will converge to a ‘constant’ after first few iterations (i.e., when i get varies). So it is better to pre-calculate this ‘constant’ for a greater value of n as:

T = cos(arctan(1/20)) x cos(arctan(1/21)) x ..... x cos(arctan(1/2n))Calculated value of T will always be 0.60725293500888. We can use any precision for T. But the accuracy of the calculation of sine and cosine depends on the precision we use and so it is recommended to use at least 6 precision in your calculation.While we program, the value of the angle arctan(1/2i) can be pre-calculated and stored in an array. This value can be used in the iterations and it avoids the calculation at the iterative time.

Page 23: Computer Arithmetic

CORDIC algorithm• The steps for CORDIC algorithm are:

1. Get the angle and store it in Angle. Store the pre-calculated arctan values in an array2. Assign X = 0.607252935 (i.e., X=T), Y=03. Find X′ and Y′4. If sign of Angle is positive then

X = X - Y′Y = Y + X′

else ( If sign of Angle is negative )X = X + Y′Y = Y - X′

5. Repeat steps (3) and (4) till the Angle approaches 06. Print “Value of cos =” X7. Print “Value of sin =” Y8. Exit

Page 24: Computer Arithmetic

Scaling Accumulator Multipliers• Parallel by serial algorithm• Iterative shift add routine• N clock cycles to complete• Very compact design• Serial input can be MSB or LSB first depending on direction of shift in accumulator• Parallel output

A scaling accumulator multiplier performs multiplication using an iterative shift-add routine. One input is presented in bit parallel form while the other is in bit serial form. Each bit in the serial input multiplies the parallel input by either 0 or 1. The parallel input is held constant while each bit of the serial input is presented. Note that the one bit multiplication either passes the parallel input unchanged or substitutes zero. The result from each bit is added to an accumulated sum. That sum is shifted one bit before the result of the next bit multiplication is added to it.

1     10110010      0000000 1    1011001  1  +1011001      10010000101

Page 25: Computer Arithmetic

Serial by Parallel Booth Multipliers

• Bit serial adds eliminate need for carry chain• Well suited for FPGAs without fast carry logic• Serial input LSB first• Serial output• Routing is all nearest neighbor except serial input which is broadcast• Latency is one bit time

The simple serial by parallel booth multiplier is particularly well suited for bit serial processors implemented in FPGAs without carry chains because all of its routing is to nearest neighbors with the exception of the input. The serial input must be sign extended to a length equal to the sum of the lengths of the serial input and parallel input to avoid overflow, which means this multiplier takes more clocks to complete than the scaling accumulator version. This is the structure used in the venerable TTL serial by parallel multiplier.

Page 26: Computer Arithmetic

Ripple Carry Array Multipliers

• Row ripple form• Unrolled shift-add algorithm• Delay is proportional to N

A ripple carry array multiplier (also called row ripple form) is an unrolled embodiment of the classic shift-add multiplication algorithm. The illustration shows the adder structure used to combine all the bit products in a 4x4 multiplier. The bit products are the logical and of the bits from each input. They are shown in the form x,y in the drawing. The maximum delay is the path from either LSB input to the MSB of the product, and is the same (ignoring routing delays) regardless of the path taken. The delay is approximately 2*n.

This basic structure is simple to implement in FPGAs, but does not make efficient use of the logic in many FPGAs, and is therefore larger and slower than other implementations.

Page 27: Computer Arithmetic

Row Adder Tree Multipliers

• Optimized Row Ripple Form • Fundamentally same gate count as row ripple form• Row Adders arranged in tree to reduce delay• Routing more difficult, but workable in most FPGAs• Delay proportional to log2(N)

Row Adder tree multipliers rearrange the adders of the row ripple multiplier to equalize the number of adders the results from each partial product must pass through. The result uses the same number of adders, but the worst case path is through log2(n) adders instead of through n adders. In strictly combinatorial multipliers, this reduces the delay. For pipelined multipliers, the clock latency is reduced. The tree structure of the routing means some of the individual wires are longer than the row ripple form. As a result a pipelined row ripple multiplier can have a higher throughput in an FPGA (shorter clock cycle) even though the latency is increased.

Page 28: Computer Arithmetic

Carry Save Array Multipliers• Column ripple form• Fundamentally same delay and gate count as row ripple form• Gate level speed ups available for ASICs• Ripple adder can be replaced with faster carry tree adder• Regular routing pattern

Page 29: Computer Arithmetic

Look-Up Table (LUT) Multipliers• Complete times table of all possible input combinations• One address bit for each bit in each input• Table size grows exponentially• Very limited use• Fast - result is just a memory access away

Look-Up Table multipliers are simply a block of memory containing a complete multiplication table of all possible input combinations. The large table sizes needed for even modest input widths make these impractical for FPGAs.The following table is the contents for a 6 input LUT for a 3 bit by 3 bit multiplication table.

000 001 010 011 100 101 110 111

000 000000 000000 000000 000000 000000 000000 000000 000000

001 000000 000001 000010 000011 000100 000101 000110 000111

010 000000 000010 000100 000110 001000 001010 001100 001110

011 000000 000011 000110 001001 001100 001111 010010 010101

100 000000 000100 001000 001100 010000 010100 011000 011100

101 000000 000101 001010 001111 010100 011001 011110 100011

110 000000 000110 001100 010010 011000 011110 100100 101010

111 000000 000111 001110 010101 011100 100011 101010 110001

Page 30: Computer Arithmetic

Partial Product LUT Multipliers

• Works like long hand multiplication• LUT used to obtain products of digits• Partial products combined with adder tree

Partial Products LUT multipliers use partial product techniques similar to those used in longhand multiplication (like you learned in 3rd grade) to extend the usefulness of LUT multiplication.  Consider the long hand multiplication:

      67  x 54

28240350

+30003618

   67  x 54

28   240

350+3000

3618

      67  x 54

28   240

350+3000

3618

     67  x 54

28   240

350+3000

3618

Page 31: Computer Arithmetic

Partial Product LUT Multipliers

• By performing the multiplication one digit at a time and then shifting and summing the individual partial products, the size of the memorized times table is greatly reduced. While this example is decimal, the technique works for any radix. The order in which the partial products are obtained or summed is not important. The proper weighting by shifting must be maintained however.

• The example below shows how this technique is applied in hardware to obtain a 6x6 multiplier using the 3x3 LUT multiplier shown above. The LUT (which performs multiplication of a pair of octal digits) is duplicated so that all of the partial products are obtained simultaneously. The partial products are then shifted as needed and summed together. An adder tree is used to obtain the sum with minimum delay.

• The LUT could be replaced by any other multiplier implementation, since LUT is being used as a multiplier. This gives the insight into how to combine multipliers of an arbitrary size to obtain a larger multiplier.

• The LUT multipliers shown have matched radices (both inputs are octal). The partial products can also have mixed radices on the inputs provided care is taken to make sure the partial products are shifted properly before summing. Where the partial products are obtained with small LUTs, the most efficient implementation occurs when LUT is square (ie the input radices are the same). For 8 bit LUTs, such as might be found in an Altera 10K FPGA, this means the LUT radix is hexadecimal. For 4 bit LUTs, found in many FPGA logic cells, the ideal radix is 2 bits (This is really the only option for a 4 LUT: a 1 bit input reduces the LUT to an AND gate, and since each LUT cell has 1 output, it can only use one bit on the other input).

• A more compact but slower version is possible by computing the partial products sequentially using one LUT and accumulating the results in a scaling accumulator. Note that in this case, the shifter would need a special control to obtain the proper shift on all the partials

Page 32: Computer Arithmetic

Computed Partial Product Multipliers

• Partial product optimization for FPGAs having small LUTs• Fewer partial products decrease depth of adder tree• 2 x n bit partial products generated by logic rather than LUT• Smaller and faster than 4 LUT partial product multipliers

A partial product multiplier constructed from the 4 LUTs found in many FPGAs is not very efficient because of the large number of partial products that need to be summed (and the large number of LUTs required). A more efficient multiplier can be made by recognizing that a 2 bit input to a multiplier produces a product 0,1,2 or 3 times the other input. All four of these products are easily generated in one step using just an adder and shifter. A multiplexer controlled by the 2 bit multiplicand selects the appropriate product as shown below. Unlike the LUT solution, there is no restriction on the width of the A input to the partial product. This structure greatly reduces the number of partial products and the depth of the adder tree. Since the 0,1,2 and 3x inputs to the multiplexers for all the partial products are the same, one adder can be shared by all the partial product generators. This structure works well in coarser grained FPGAs like the Xilinx 4K series.

2 x n bit partial product generated with adder and multiplexer

Page 33: Computer Arithmetic

Computed Partial Product Multipliers

• The Xilinx Virtex device includes an extra gate in the carry chain logic that allows a 4 input LUT plus the carry chain to perform a 2xN partial product, thereby achieving twice the density otherwise attainable. In this case, the adder (consisting of the XOR gates and muxes in the carry chain) adds a pair of 1xN partial products obtained with AND gates. The extra AND gate in the carry logic allows you to put AND gates on both inputs of the adder while maintaining a 4 input function.

2 x n bit computed partial product implemented in Xilinx Virtex using special MULTAND gate in carry chain logic

Page 34: Computer Arithmetic

Constant Coefficient Multipliers

• Multiplies input by a constant• LUT contains custom times table• Width of constants do not affect depth of adder tree• All LUT inputs available for multiplicand• More efficient than full multiplier• Size is constant regardless of value of constant (ass

uming equal constant bit widths)

A full multiplier accepts the full range of inputs for each multiplicand. If one of the multiplicands is a constant, then it is far more efficient to construct a times table that only has the column corresponding to the constant value. These are known as constant (K) Coefficient Multipliers or KCM's. The example below multiplies a 5 bit input (values 0 to 31) by a constant 67. Note that with a constant multiplier, all of the LUT inputs are available for the variable multiplicand. This makes the KCM more efficient than a full multiplier (fewer partial products for a given width).

When the LUT does not offer enough inputs to accommodate the desired variable width, several identical LUTs may be combined using the partial products techniques discussed above. In this case, the constant multiplicand is full width, so the partial products will be m x n where m is the number of LUT inputs and n is the width of the constant.

5 bit input * 67

input 00 01 10 11

000 0 536 1072 1608

001 67 603 1139 1675

010 134 670 1206 1742

011 201 737 1273 1809

100 268 804 1340 1876

101 335 871 1407 1943

110 402 938 1474 2010

111 469 1005 1541 2077

Page 35: Computer Arithmetic

Limited Set LUT Multipliers

• Multiplies input by one of a small set of constants• Similar to KCM multiplier• LUT input bit(s) select which constant to use• Useful in modulators, other signal processing applic

ationsIn signal processing, there are often instances where one multiplicand is taken from of a small set of constant values. In these cases, the KCM multiplier can be extended so that the LUT contains the times tables for each constant. One or more of the LUT inputs select which constant is used, while the remaining inputs are for the variable multiplicand. The example below is a 6 LUT containing times tables for the constants 67 and 85. One bit of the input selects which times table is used. The remaining inputs are the 5 bit variable multiplicand (values from 0 to 31). Again, the input width can be extended using the partial product techniques discussed previously.

5 bit input * 67 5 bit input * 85

000 001 010 011 100 101 110 111

000 0 536 1072 1608 0 680 1360 2040

001 67 603 1139 1675 85 765 1445 2125

010 134 670 1206 1742 170 850 1530 2210

011 201 737 1273 1809 255 935 1615 2295

100 268 804 1340 1876 340 1020 1700 2380

101 335 871 1407 1943 425 1105 1785 2465

110 402 938 1474 2010 510 1190 1870 2550

111 469 1005 1541 2077 595 1275 1955 2635

Page 36: Computer Arithmetic

Constant Multipliers from Adders

• Adder for each '1' bit in constant• Subtractor replaces strings of '1' bits using Booth recoding• Efficiency, size depend on value of constant• KCM multipliers are usually more efficient for arbitrary constant values

The shift-add multiply algorithm essentially produces m 1xn partial products and sums them together with appropriate shifting. The partial products corresponding to '0' bits in the 1 bit input are zero, and therefore do not have to be included in the sum. If the number of '1' bits in a constant coefficient multiplier is small, then a constant multiplier may be realized with wired shifts and a few adders as shown in the 'times 10' example below.

In cases where there are strings of '1' bits in the constant, adders can be eliminated by using Booth recoding methods with subtractors. The 'times 14 example below illustrates this technique. Note that 14 = 8+4+2 can be expressed as 14=16-2, which reduces the number of partial products.

0      00000001     1011001 0    0000000  1  +1011001       1101111010

0     00000001     10110011    1011001  1  +1011001        10011011110

0       0000000-1   1110100111 0     00000000    0000000  1   +1011001         10011011110

Page 37: Computer Arithmetic

Constant Multipliers from Adders

• Combinations of partial products can sometimes also be shifted and added in order to reduce the number of partials, although this may not necessarily reduce the depth of a tree.  For example, the 'times 1/3' approximation (85/256=0.332) below uses less adders than would be necessary if all the partial products were summed directly.   Note that the shifts are in the opposite direction to obtain the fractional partial products.

• Clearly, the complexity of a constant multiplier constructed from adders is dependent upon the constant.  For an arbitrary constant, the KCM multiplier discussed above is a better choice.  For certain 'quick and dirty' scaling applications, this multiplier works nicely.

Page 38: Computer Arithmetic

Wallace Trees• Optimized column adder tree• Combines all partial products into 2 vectors (carry and sum)• Carry and sum outputs combined using a conventional adder• Delay is log(n)• Wallace tree multiplier uses Wallace tree to combine 1 x n partial products • Irregular routing• Not optimum in many FPGAs

A Wallace tree is an implementation of an adder tree designed for minimum propagation delay. Rather than completely adding the partial products in pairs like the ripple adder tree does, the Wallace tree sums up all the bits of the same weights in a merged tree. Usually full adders are used, so that 3 equally weighted bits are combined to produce two bits: one (the carry) with weight of n+1 and the other (the sum) with weight n. Each layer of the tree therefore reduces the number of vectors by a factor of 3:2 (Another popular scheme obtains a 4:2 reduction using a different adder style that adds little delay in an ASIC implementation). The tree has as many layers as is necessary to reduce the number of vectors to two (a carry and a sum). A conventional adder is used to combine these to obtain the final product. The structure of the tree is shown below. The red numbers after each full adder in the illustration indicate the bit weights of each signal. For a multiplier, this tree is pruned because the input partial products are shifted by varying amounts.

Page 39: Computer Arithmetic

Wallace Trees

A section of an 8 input wallace tree. The wallace tree combines the 8 partial product inputs to two output vectors corresponding to a sum and a carry. A conventional adder is used to combine thes

e outputs to obtain the complete product..

Page 40: Computer Arithmetic

Wallace Trees• If you trace the bits in the tree (the tree in the illustr

ation is color coded to help in this regard), you will find that the Wallace tree is a tree of carry-save adders arranged as shown to the left. A carry save adder consists of full adders like the more familiar ripple adders, but the carry output from each bit is brought out to form second result vector rather being than wired to the next most significant bit. The carry vector is 'saved' to be combined with the sum later, hence the carry-save moniker.

• A Wallace tree multiplier is one that uses a Wallace tree to combine the partial products from a field of 1x n multipliers (made of AND gates). It turns out that the number of Carry Save Adders in a Wallace tree multiplier is exactly the same as used in the carry save version of the array multiplier. The Wallace tree rearranges the wiring however, so that the partial product bits with the longest delays are wired closer to the root of the tree. This changes the delay characteristic from o(n*n) to o(n*log(n)) at no gate cost. Unfortunately the nice regular routing of the array multiplier is also replaced with a ratsnest.

Page 41: Computer Arithmetic

Wallace Trees• A Wallace tree by itself offers no performance advantage over a ripple adder tree• To the casual observer, it may appear the propagation delay though a ripple adder tree is the carry

propagation multiplied by the number of levels or o(n*log(n)). In fact, the ripple adder tree delay is really only o(n + log(n)) because the delays through the adder's carry chains overlap. This becomes obvious if you consider that the value of a bit can only affect bits of the same or higher significance further down the tree. The worst case delay is then from the LSB input to the MSB output (and disregarding routing delays is the same no matter which path is taken). The depth of the ripple tree is log(n), which is the about same as the depth of the Wallace tree. This means is that the ripple carry adder tree's delay characteristic is similar to that of a Wallace tree followed by a ripple adder! If an adder with a faster carry tree scheme is used to sum the Wallace tree outputs, the result is faster than a ripple adder tree. The fast carry tree schemes use more gates than the equivalent ripple carry structure, so the Wallace tree normally winds up being faster than a ripple adder tree, and less logic than an adder tree constructed of fast carry tree adders. An ASIC implementation may also benefit from some 'go faster' tricks in carry save adders, so a Wallace tree combined with a fast adder can offer a significant advantage there.

• A Wallace tree is often slower than a ripple adder tree in an FPGA • Many FPGAs have a highly optimized ripple carry chain connection. Regular logic connections are s

everal times slower than the optimized carry chain, making it nearly impossible to improve on the performance of the ripple carry adders for reasonable data widths (at least 16 bits). Even in FPGAs without optimized carry chains, the delays caused by the complex routing can overshadow any gains attributed to the Wallace tree structure. For this reason, Wallace trees do not provide any advantage over ripple adder trees in many FPGAs. In fact due to the irregular routing, they may actually be slower and are certainly more difficult to route.

Page 42: Computer Arithmetic

Booth Recoding• Booth recoding is a method of reducing the

number of partial products to be summed.   Booth observed that when strings of '1' bits occur in the multiplicand the number of partial products can be reduced by using subtraction.  For example the multiplication of 89 by 15 shown below has four 1xn partial products that must be summed.   This is equivalent to the subtraction shown in the right panel.

1   10110011      10110011     10110011    10110010  +0000000          10100110111

1      -10110011       00000001     00000001    00000000  +1011001           10100110111